Mar 22, 2024


Event Description

Abstract: While human-in-the-loop applications like search and chatbots are highly visible and impactful to everyday users, much of our economy also depends on batch computing systems that run behind the scenes to process and analyze massive amounts of financial, health, and educational data. Research and industry are increasingly excited about their potential of LLMs for data management. However, naively applying LLMs to thousands to millions of entries or documents in a database is expensive ($$, time). To build our understanding, we explored how to use LLMs for the long-studied data management task of generating structured tables from corpora of un/semi-structured documents. We proposed Evaporate [ICLR 2023VLDB 2024], an LLM-powered system for this data problem, which processes corpora 110x faster than natural LLM baselines, while improving quality over the previously proposed systems. Towards extending Evaporate-like systems, we note the prevailing LLM architecture imposes a fundamental limitation on the throughput we can achieve in our batch computing workloads. Specifically, attention in the Transformer architecture consumes memory that grows in sequence length during inference.We thus explored whether exciting attention-free architectures, that use fixed memory during inference, could help. Unfortunately, our analysis [ICLR 2024] provides theoretical and empirical arguments to pinpoint why prior attention-free proposals (built from gated convolutions) cannot perform a skill in language modeling termed recall, grounding generations in the information provided in-context, as efficiently as attention. Intuitively, recall is key to in-context learning tasks like data processing. We use the understanding from our analysis to build Based [ArXiv 2024], a new fixed memory architecture that outperforms leading attention-free proposals (e.g. Mamba, GLA) on recall in language modeling pretraining and downstream data management tasks, with 24x faster generation than the well-optimized FlashAttention-2 implementation (1024 toks, 1.3Bn parameters, 128 batch size). We’ve learned some lessons through studying and building attention-free architectures and are excited to discuss / hear feedback during this talk session!

Bio: Simran Arora is a PhD student at Stanford University advised by Chris Ré. Her research is in machine learning systems, and lately she has been excited about building efficient and usable systems for data management. Simran also loves teaching, and recently co-created and taught a new Systems for Machine Learning course to 100+ students at Stanford. She is grateful for the support of the SGF Sequoia Fellowship. 

Event Organized by PLI
Special Event