Yu Meng1*, Mengzhou Xia2*, Danqi Chen2
*Equal Contribution
1Computer Science Department, University of Virginia
2Princeton Language and Intelligence (PLI), Princeton University
The evolution of language models has been driven primarily by large-scale training, utilizing models with billions of parameters and vast datasets. While pre-training enables these models to predict tokens given any context, transforming them into effective conversational agents requires a technique known as Reinforcement Learning from Human Feedback (RLHF). This additional step helps shape models to follow human instructions and ensures they generate safe and helpful responses. However, traditional RLHF presents significant challenges, being both computationally demanding and complex to optimize.
Recent attempts to improve instruction-following involve a self-improvement process. Using a set of queries, the model is used to generate multiple answers, which are then rated using another AI model---called the reward model. The rated answers are then used to form preference pairs to continue training the instruction-following model. The crux of the issue is what training method to use.
Together with Yu Meng from the University of Virginia and Danqi Chen from Princeton University, we developed a new training objective, SimPO---a simple yet effective alternative to RLHF. We applied SimPO to Google's gemma-2-9b-it model, which had already undergone extensive RLHF training. Further training the model using only 50,000 pairwise preference data points annotated by a strong reward model ArmoRM, we developed a significantly improved chat model, gemma-2-9b-it-simpo. This model is now the top-performing <10B parameter model on the LMSys Chatbot Arena, surpassing the original Google model by 14 places based on real human preferences. Notably, human users prefer this model over much larger models, including Llama-3-70B-Instruct, Claude 3 Sonnet, and Yi-Large. The model was trained for only 3 hours on a single H100 node (8 GPUs). The model has resonated strongly with the community, achieving 100K monthly downloads on Hugging Face. To advance open-source research in preference optimization, we've released a comprehensive set of training checkpoints trained with various algorithms. Our model series has now reached over 400K total downloads.
Please refer to our NeurIPS 2024 paper, open-source repository, and series of models for more details on the SimPO method.
Background on RLHF and Preference Learning
Training an AI chatbot isn't just about teaching it to generate responses—we need to ensure those responses are helpful, harmless, and honest. This raises a fundamental question: how do we determine which responses are "better"? Much like a boxing match needs a referee to determine the winner, we need systematic ways to evaluate and improve model responses. A common approach now uses advanced language models as judges to compare responses from two models, calculating a "win-rate"—the percentage of evaluations where one model's response is considered better than the other model's. For example, AlpacaEval 2 uses GPT-4 Preview (11/06) as the baseline model and uses the same model to judge which responses are better.
RLHF (Reinforcement Learning from Human Feedback) is a key technique to train models to produce better responses. It uses reinforcement learning to train models to generate responses that maximize rewards according to an external reward model. While this approach has been shown to be effective, it presents significant challenges: it's computationally expensive and difficult to optimize due to the multiple models involved in the process.
Recent advancements have simplified this process by eliminating the need for an external reward model, instead training directly on pairwise preference data in an end-to-end fashion.To illustrate, consider the following example prompt and two candidate responses:
- Question: What activity was Albert Einstein always known to do while he was at Institute for Advanced Study at Princeton?
- Response 1 (Preferred): Einstein was known for taking long walks around the campus and town, often lost in thought.
- Response 2 (Not Preferred): Einstein was known for challenging students to impromptu games of chess on campus, claiming that winning would reveal the secret to faster-than-light travel.
In this case, Response 1 is preferred because it's factually accurate, while Response 2 contains fabricated information. This approach, known as preference learning, trains the language model directly on such pairs to maximize the probability of preferred responses while minimizing the probability of less preferred ones. To create these training pairs, we can either collect ratings from human evaluators or use off-the-shelf reward models for automated annotation. Both approaches help build the large-scale preference datasets needed for effective training.
SimPO: A Simpler Alternative to RLHF
SimPO simplifies the training objective by turning the reinforcement learning process into a supervised learning process, like Direct Preference Optimization. SimPO is even simpler compared to DPO by focusing on the key metric: the log-likelihood of a sequence. For the training dataset, we generate multiple responses from the model itself, and uses a strong off-the-shelf reward model to select the best and the worst responses. Then we train the instruction-following model to minimize the difference of the log likelihood of the worst and the best response. For further details on the method, please refer to our paper.
Through expensive evaluation, we show that SimPO outperforms DPO across multiple benchmarks and settings.
gemma-2-9b-it-SimPO: A Strong Chat Model
Using SimPO, we developed a strong chat model, gemma-2-9b-it-simpo. In simple terms, we use prompts from the open-source dataset UltraFeedback to generate multiple responses for each prompt with Google's gemma-2-9b-it model. We use a reward model ArmoRM to select the best and the worst response for each question. Notably, the ArmoRM model is trained on a large-scale preference dataset annotated by either real human raters or other advanced language models such as GPT-4. This process results in 50k preference pairs. We then train the gemma-2-9b-it model with the preference dataset using SimPO for 1 epoch. The whole training process is highly efficient and only takes less than 3 hours on 8 H100 GPUs, and thus is extremely accessible for the community to reproduce.
Surprisingly, this simple training procedure results in a much stronger chat model than the initial gemma-2-9b-it model, which had already undergone a significant RLHF process. The resulting model, gemma-2-9b-it-simpo, has significantly improved upon the performance of the original gemma-2-9b-it model across various benchmarks.
Below is a summary of gemma-2-9b-it-SimPO's rankings on different benchmarks:
Benchmark | Overall ranking | <10B Ranking |
---|---|---|
AlpacaEval 2 | 8 | 8 (1 at the time of release) |
Arena-Hard | 14 | 1 |
WildBench | 21 | 1 |
LMSys Chatbot Arena | 28 | 1 |
AlpacaEval 2, Arena-Hard, and WildBench are chat-focused benchmarks that use model-based evaluations to compare outputs from different models and establish rankings. In contrast, LMSys Chatbot Arena relies on real human preferences to assess chat model performance. Our findings show that the SimPO-enhanced model achieves significant improvements over the original gemma-2-9b-it model across all benchmarks, with consistent gains observed in both model-based and human evaluations.
If you are interested in using SimPO for your development, please refer to the SimPO GitHub repo, the TRL repo or Llama factory repo for more details!
Acknowledgements
We thank Sanjeev Arora for his thoughtful guidance on polishing this blog post, and Danqi Chen, Yu Meng, and Sadhika Malladi for their careful proofreading and suggestions. We also appreciate Wei-Lin Chiang's help in evaluating our model on the Chatbot Arena platform.