Demystifying Reinforcement Learning from Human Feedback (RLHF): A Comprehensive Guide

Language models have achieved remarkable progress in recent years, demonstrating an impressive ability to generate diverse and compelling text from human prompts. However, defining what constitutes “good” text remains a complex challenge, as it is inherently subjective and context-dependent. Applications range from creative storytelling, demanding imagination, to informative articles requiring factual accuracy, and code snippets needing to be executable.

Creating a loss function that effectively captures these diverse attributes is incredibly difficult. Consequently, most language models are still trained using a straightforward next-token prediction loss, such as cross-entropy. To address the limitations of this loss function, metrics like BLEU and ROUGE have been developed to better assess human preferences. While these metrics offer improvements over the basic loss function in performance measurement, they rely on simple rules to compare generated text against references, thus remaining limited. Imagine if we could leverage human feedback directly to evaluate generated text and, even more powerfully, use this feedback as a loss function to guide model optimization. This is the core concept behind Reinforcement Learning From Human Feedback (RLHF): employing reinforcement learning techniques to directly optimize a language model based on human feedback. RLHF empowers language models to align their training on vast text datasets with nuanced human values.

The recent success of RLHF is prominently highlighted by its use in ChatGPT. Given ChatGPT’s exceptional capabilities, we asked it to explain RLHF:

It provides a surprisingly good overview, yet it misses some key details. Let’s delve deeper and fill in those gaps!

RLHF: A Step-by-Step Breakdown

Reinforcement Learning from Human Feedback, also known as Reinforcement Learning from human preferences, is a multifaceted concept involving a multi-model training process and distinct deployment stages. In this article, we will dissect the training process into three fundamental steps:

Pretraining a Language Model (LM)
Gathering data and training a Reward Model (RM)
Fine-tuning the LM with Reinforcement Learning

Let’s begin by examining the pretraining of language models.

Pretraining Language Models

RLHF typically starts with a language model that has already undergone pretraining using standard pretraining objectives. For a detailed understanding of this process, refer to this blog post. OpenAI utilized a scaled-down version of GPT-3 for their pioneering RLHF model, InstructGPT. Anthropic, in their published research, employed transformer models ranging from 10 million to 52 billion parameters for this initial stage. DeepMind has documented using models as large as their 280 billion parameter Gopher. It’s highly probable that these organizations are employing even larger models in their RLHF-powered products today.

This initial model can be further fine-tuned on additional text or under specific conditions, although this is not always necessary. For instance, OpenAI fine-tuned their model on human-generated text deemed “preferable,” while Anthropic created their starting LM for RLHF by distilling an original LM based on context clues aligned with their “helpful, honest, and harmless” criteria. These methods represent sources of valuable, augmented data, but they are not essential for grasping the fundamental principles of RLHF. The crucial starting point for RLHF is having a model that demonstrates robust responsiveness to diverse instructions.

Currently, there’s no definitive answer to the question of “which model” serves as the optimal starting point for RLHF. This lack of clear-cut answers will be a recurring theme throughout this discussion, as the design space of options within RLHF training remains largely unexplored.

Once a language model is in place, the next step involves generating data to train a reward model, which is the mechanism for incorporating human preferences into the system.

Caption: The initial phase of RLHF involves pretraining a language model on a vast corpus of text data, equipping it with a foundational understanding of language.

Reward Model Training

The development of a reward model (RM), also known as a preference model, calibrated with human preferences marks the beginning of relatively recent research in RLHF. The primary objective is to create a model or system that accepts a text sequence as input and produces a scalar reward. This reward should numerically represent the degree of human preference for the given text. This system could be an end-to-end LM or a modular system that outputs a reward, for example, a model that ranks outputs and converts the ranking into a reward. The scalar nature of the reward is essential for seamless integration with existing RL algorithms later in the RLHF process.

Reward models can be either fine-tuned LMs or LMs trained from scratch using preference data. Anthropic, for example, has utilized a specialized fine-tuning method (preference model pretraining, PMP) to initialize these models after pretraining. They found PMP to be more sample-efficient compared to standard fine-tuning. However, no single base model is universally recognized as the best choice for reward models.

The training dataset for the RM, consisting of prompt-generation pairs, is created by sampling prompts from a predefined dataset. Anthropic’s data, primarily generated using a chat tool on Amazon Mechanical Turk, is publicly available on the Hub. OpenAI utilized prompts submitted by users to the GPT API. These prompts are then fed into the initial language model to generate text outputs.

Human annotators are then tasked with ranking these generated text outputs. While it might seem intuitive to have humans directly assign scalar scores to each piece of text to train a reward model, this approach is practically challenging. The subjective nature of human values leads to uncalibrated and noisy scores. Instead, employing rankings to compare outputs from multiple models yields a significantly better-regularized dataset.

Several methods exist for ranking text. One successful approach involves having users compare generated text from two language models, both conditioned on the same prompt. By comparing model outputs in head-to-head matchups, an Elo system can be used to establish a ranking of models and outputs relative to each other. These diverse ranking methods are then normalized into a scalar reward signal, suitable for training the reward model.

An interesting observation from successful RLHF systems is the variability in the sizes of reward language models relative to the text generation models. For example, OpenAI used a 175B LM with a 6B reward model, while Anthropic employed LMs and reward models ranging from 10B to 52B. DeepMind utilizes 70B Chinchilla models for both LM and reward. Intuitively, preference models likely need comparable capacity to comprehend the text they are evaluating as a model generating that text would need.

At this stage of the RLHF system, we have an initial language model capable of generating text and a reward model that can assess any text and assign a score reflecting human perception of its quality. The next step is to use reinforcement learning (RL) to optimize the original language model based on the reward model’s feedback.

Fine-tuning with RL

Training a language model using reinforcement learning was long considered an insurmountable challenge, both from engineering and algorithmic perspectives. However, multiple organizations have successfully fine-tuned some or all parameters of a copy of the initial LM using a policy-gradient RL algorithm, Proximal Policy Optimization (PPO). Freezing some parameters of the LM is often necessary because fine-tuning an entire 10B or 100B+ parameter model is prohibitively expensive. Techniques like Low-Rank Adaptation (LoRA) for LMs and DeepMind’s Sparrow LM address this challenge. The optimal strategy for determining how many parameters to freeze remains an open research question. PPO is a well-established algorithm with abundant resources and guides, tutorials available. Its maturity made it a favorable choice for scaling up to the novel application of distributed training for RLHF. Many core advancements in applying RL to RLHF have focused on adapting familiar algorithms to update such large models efficiently.

Let’s formalize this fine-tuning task as an RL problem. The policy is a language model that takes a prompt and outputs a text sequence (or probability distributions over text). The action space encompasses all tokens in the language model’s vocabulary (often around 50k tokens). The observation space is the vast distribution of possible input token sequences. The reward function combines the preference model’s output with a constraint on policy shift.

The reward function is where all the previously discussed models integrate within the RLHF process. Given a prompt, x, from the dataset, text y is generated by the current iteration of the fine-tuned policy. This text, combined with the original prompt, is fed into the preference model, which returns a scalar “preferability” score, $rtheta$. Furthermore, per-token probability distributions from the RL policy are compared to those from the initial model to calculate a penalty for their divergence. In numerous papers from OpenAI, Anthropic, and DeepMind, this penalty is designed as a scaled version of the Kullback–Leibler (KL) divergence between these token distribution sequences, $rtext{KL}$. The KL divergence term discourages the RL policy from deviating too drastically from the initial pretrained model in each training batch. This helps ensure the model continues to generate reasonably coherent text. Without this penalty, the optimization might lead to the generation of nonsensical text that, paradoxically, receives high rewards from the reward model. In practice, the KL divergence is approximated through sampling from both distributions, as explained by John Schulman here. The final reward signal used for the RL update rule is $r = rtheta – lambda rtext{KL}$.

Some RLHF systems incorporate additional terms into the reward function. For instance, OpenAI successfully experimented with InstructGPT by integrating additional pre-training gradients (from the human annotation set) into the PPO update rule. As RLHF research advances, the formulation of this reward function is likely to continue evolving.

Finally, the update rule is the parameter update derived from PPO, aimed at maximizing the reward metrics within the current data batch (PPO is an on-policy algorithm, meaning parameters are updated only with the current batch of prompt-generation pairs). PPO is a trust region optimization algorithm that applies constraints to the gradient to ensure the update step does not destabilize the learning process. DeepMind employed a similar reward setup for Gopher but utilized synchronous advantage actor-critic (A2C) to optimize gradients, a notable difference that has not been externally reproduced.

Caption: In the RL fine-tuning phase, the pretrained language model is optimized using reinforcement learning, guided by the reward model and human preferences, to generate more aligned and desirable text outputs.

Technical detail note: The diagram above might suggest that both models generate distinct responses for the same prompt. However, in reality, the RL policy generates text, and this text is then fed into the initial model to obtain its relative probabilities for the KL penalty. The initial model remains untouched by gradient updates during training.

Optionally, RLHF can be extended iteratively by continuously updating both the reward model and the policy. As the RL policy evolves, users can continue to rank its outputs against earlier versions of the model. Most research papers have not yet extensively discussed this iterative online RLHF approach, as the data collection mode required is primarily applicable to dialogue agents with access to an active user base. Anthropic refers to this option as Iterated Online RLHF (detailed in their original paper). This introduces complex dynamics as both policy and reward model evolve, representing a complex and open area of research.

Open-Source Tools for RLHF

The first publicly released code for performing RLHF on LMs was from OpenAI in TensorFlow in 2019.

Currently, several active PyTorch repositories have emerged for RLHF, building upon this foundation. The primary repositories include Transformers Reinforcement Learning (TRL), TRLX, which originated as a fork of TRL, and Reinforcement Learning for Language models (RL4LMs).

TRL is designed for fine-tuning pretrained LMs within the Hugging Face ecosystem using PPO. TRLX, an expanded fork of TRL developed by CarperAI, is designed to handle larger models for both online and offline training. Currently, TRLX offers an API suitable for production-ready RLHF with PPO and Implicit Language Q-Learning ILQL at the scales required for LLM deployment (e.g., 33 billion parameters). Future iterations of TRLX will support language models with up to 200B parameters. Interfacing with TRLX is optimized for machine learning engineers experienced with these scales.

RL4LMs provides building blocks for fine-tuning and evaluating LLMs using a diverse array of RL algorithms (PPO, NLPO, A2C, and TRPO), reward functions, and metrics. Its customizable nature facilitates the training of any encoder-decoder or encoder transformer-based LM on arbitrary user-defined reward functions. Notably, it is well-tested and benchmarked across a wide range of tasks in recent research, encompassing up to 2000 experiments. This research highlights practical insights into data budget comparisons (expert demonstrations vs. reward modeling), managing reward hacking, and addressing training instabilities. RL4LMs’ future plans include distributed training for larger models and the integration of new RL algorithms.

Both TRLX and RL4LMs are under active development, with more features anticipated in the near future.

A large dataset created by Anthropic is available on the Hub, providing valuable resources for RLHF research and development.

What’s Next for RLHF?

While RLHF techniques are incredibly promising and impactful, capturing the attention of major AI research labs, they still have clear limitations. Current models, although improved, can still generate harmful or factually inaccurate text without expressing uncertainty. This imperfection represents a long-term challenge and ongoing motivation for RLHF. Operating within the inherently human domain of values means there will likely never be a definitive endpoint for model “completeness.”

Deploying RLHF systems involves a significant cost in gathering human preference data due to the direct integration of human input outside the training loop. RLHF’s effectiveness is directly tied to the quality of human annotations, which come in two primary forms: human-generated text, as in fine-tuning the initial LM in InstructGPT, and human preference labels between model outputs.

Generating high-quality, human-written text in response to specific prompts is expensive, often requiring dedicated part-time staff rather than relying on product users or crowdsourcing. Fortunately, the data scale needed for training reward models in most RLHF applications (around 50k labeled preference samples) is less prohibitive. However, it still represents a higher cost than many academic labs can readily afford. Currently, only one large-scale dataset exists for RLHF on a general language model (from Anthropic), along with a few smaller task-specific datasets (such as summarization data from OpenAI). A further challenge in RLHF data is the potential for disagreement among human annotators, introducing substantial variance into the training data without a clear ground truth.

Despite these limitations, vast unexplored design options could enable RLHF to achieve significant progress. Many of these lie within improving the RL optimizer. PPO, while a relatively mature algorithm, is not structurally mandated, and other algorithms could offer benefits or variations to the existing RLHF workflow. A major computational cost in fine-tuning the LM policy with feedback is the need to evaluate every generated text piece from the policy on the reward model (as it acts as part of the environment in the standard RL framework). Offline RL could be employed as a policy optimizer to mitigate these costly forward passes of large models. Recent algorithms, such as implicit language Q-learning (ILQL) [Talk on ILQL at CarperAI], are particularly well-suited to this type of optimization. Other core trade-offs in the RL process, such as exploration-exploitation balance, remain largely undocumented. Exploring these avenues could significantly deepen our understanding of RLHF mechanisms and potentially lead to performance improvements.

We hosted a lecture on Tuesday, December 13, 2022, further elaborating on this post; you can watch it here!