Language model alignment is an important aspect of reinforcement learning from human feedback (RLHF), as it depends on the quality of the underlying reward model. To achieve the optimal performance and alignment in language models, researchers have been working on developing a reward model that accurately reflects human preferences.
A new method called West-of-N has been introduced by researchers from ETH Zurich, Max Planck Institute for Intelligent Systems, Tubingen, and Google Research to enhance reward model quality by incorporating synthetic preference data into the training dataset. This method has been shown to significantly improve reward model performance, compared to other synthetic preference generation methods.
Experimental results have demonstrated the effectiveness of the West-of-N method in enhancing reward model performance, surpassing gains from additional human feedback data and outperforming other synthetic preference generation methods. The study suggests further exploring methods like noisy student training to elevate reward model performance in conjunction with West-of-N.
The study evaluates the West-of-N synthetic preference data generation method on the Reddit TL;DR summarization and Anthropic Helpful and Harmless dialogue datasets. Results indicate that West-of-N consistently improves model accuracy, Best-of-N sampling, and RL-finetuning across different base preference types, demonstrating its effectiveness in language model alignment. If you are interested in the full details, check out the Paper.