Date
Sep 5, 2024

Speakers

Details

Event Description

Video Link: Vanishing Gradients in Reinforcement Finetuning of Language Models

Abstract: Language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. In this talk, I will present a recent work identifying a fundamental optimization obstacle in RFT. Namely, the expected gradient for an input vanishes when its reward standard deviation is small, even if the expected reward is far from optimal. Through a combination of theory and experiments, we demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.
The work covered in this talk was done in collaboration with Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, and Etai Littwin.

Bio: Noam Razin


Video Link: Interpretability Insights for Localized Fine-tuning on LLM Representations

Abstract: In this talk, we will discuss how interpretability insights can guide the effective and efficient steering of Large Language Models (LLMs) towards desired behaviors. We will focus on adapting LLMs through localized fine-tuning on LLM representations (activations of intermediate modules), inspired by mechanistic interpretability insights on localization of knowledge or skills within LLMs. We introduce LoFit (Localized Fine-Tuning), which identifies a sparse set of attention heads that are most important for learning desired behaviors, then trains offset vectors to add to the model’s hidden representations at these selected heads. Our experiments on truthfulness and reasoning tasks demonstrate that LoFit can effectively adapt LLMs using limited training data and computational resources. Through a series of analyses, we highlight the effectiveness of the localization step, underscoring the importance of interpretability insights.

Bio: Xi Ye

Special Event