Apr 25, 2024


Event Description

Abstract: What data should we train large-scale models on? Standard data selection approaches tend to filter for data that matches handpicked notions of quality. Such filtering yields data that *looks* cleaner, but does it improve model performance?

In this talk, we first show that in practice, these methods often do not help (and can even hurt) performance compared to selecting data at random. To better select data, we move away from such handpicked notions of quality, and instead present an optimization-based perspective on data selection. Finally, we show that our resulting "model-aware" data selection strategies greatly improve language model performance.

Bio: Logan Engstrom is a PhD student at MIT advised by Prof. Aleksander MÄ…dry. His research focusses on improving and understanding machine learning through the lens of the training data. He is a recipient of the Google PhD fellowship.

Event Organized by PLI
Special Event