Reinforcement Learning from Human Feedback (RLHF) Explained

Language models have demonstrated remarkable abilities in recent years, generating diverse and compelling text from simple human prompts. However, defining “good” text is challenging as it’s subjective and context-dependent. Applications range from creative story writing to informative texts requiring truthfulness, and code snippets needing executability.

Creating a loss function to capture these varied attributes is complex. Most language models are still trained using a basic next token prediction loss, like cross-entropy. To address this limitation, metrics like BLEU and ROUGE are used to better reflect human preferences. Yet, these metrics merely compare generated text to references using predefined rules, which is also restrictive. Ideally, human feedback on generated text should serve as a performance measure, or even better, as a loss function to optimize the model. This is the core idea behind Reinforcement Learning from Human Feedback (RLHF): leveraging reinforcement learning to directly optimize language models based on human feedback. RLHF enables aligning models trained on general text corpora with nuanced human values.

RLHF gained prominence through its successful application in ChatGPT. Given ChatGPT’s impressive capabilities, we asked it to explain RLHF:

It provides a decent overview, but misses some details. Let’s delve deeper and fill in the gaps!

RLHF: A Step-by-Step Breakdown

Reinforcement Learning from Human Feedback, also known as RL from human preferences, is intricate due to its multi-model training process and distinct deployment stages. We’ll simplify the training into three key steps:

Pretraining a language model (LM).
Gathering data and training a reward model.
Fine-tuning the LM using reinforcement learning.

First, let’s examine language model pretraining.

Pretraining Language Models

RLHF starts with a language model already pretrained using standard pretraining objectives (refer to this blog post for details). OpenAI utilized a smaller version of GPT-3 for InstructGPT, their pioneering RLHF model. Anthropic’s research employed transformer models ranging from 10 million to 52 billion parameters for this initial stage. DeepMind documented using models up to 280 billion parameters like Gopher. It’s plausible these organizations now use even larger models in their RLHF-powered products.

This initial model can be further fine-tuned on additional text or conditions, though it’s not always necessary. For instance, OpenAI fine-tuned on human-generated text deemed “preferable,” and Anthropic refined their initial LM for RLHF by distilling an original LM based on context clues aligned with “helpful, honest, and harmless” criteria. These represent expensive, augmented data sources, but are not essential for understanding RLHF fundamentals. The crucial starting point for RLHF is having a model that responds effectively to diverse instructions.

Determining the “best” model for RLHF initiation lacks a definitive answer, a recurring theme in RLHF training where the design space remains largely unexplored.

Once a language model is established, the next step is generating data to train a reward model, integrating human preferences into the system.

Reward Model Training

Developing a reward model (RM), also known as a preference model, that accurately reflects human preferences is a relatively recent area of RLHF research. The primary objective is to create a model or system that receives text input and outputs a scalar reward, numerically representing human preference. This system could be an end-to-end LM or a modular setup. The scalar reward output is essential for seamless integration with existing RL algorithms later in the RLHF process.

Reward models can be fine-tuned LMs or models trained from scratch on preference data. Anthropic, for example, utilized a specialized fine-tuning method (preference model pretraining, PMP) for initialization, finding it more sample-efficient than standard fine-tuning. However, no single base model is universally recognized as optimal for reward models.

The RM training dataset consists of prompt-generation pairs. Prompts are sampled from a predefined dataset (Anthropic’s data, generated using a chat tool on Amazon Mechanical Turk, is available on the Hub. OpenAI used prompts from GPT API users). These prompts are fed into the initial language model to generate text.

Human annotators then rank these generated text outputs. Initially, one might consider having humans directly assign scalar scores to each text segment to train the reward model. However, this approach is practically challenging due to inconsistent human scoring values, leading to uncalibrated and noisy data. Instead, ranking outputs and comparing them creates a more robust and regularized dataset.

Various text ranking methods exist. A successful approach involves users comparing text generated by two language models given the same prompt. By comparing model outputs head-to-head, an Elo system can rank models and outputs relative to each other. These diverse ranking methods are then normalized into a scalar reward signal for training.

An interesting observation in successful RLHF systems is the varying sizes of reward language models relative to text generation models. For instance, OpenAI used a 175B LM with a 6B reward model, while Anthropic used models from 10B to 52B for both. DeepMind employs 70B Chinchilla models for both. Intuitively, preference models need comparable capacity to understand text as models generating that text.

At this stage, the RLHF system includes an initial language model for text generation and a preference model scoring text based on human perception. Next, reinforcement learning (RL) is employed to optimize the original language model using the reward model.

Fine-tuning with RL

Training language models with reinforcement learning was once considered impractical due to engineering and algorithmic challenges. Organizations have successfully fine-tuned some or all parameters of a copy of the initial LM using Proximal Policy Optimization (PPO), a policy-gradient RL algorithm. Freezing some LM parameters is often necessary because fine-tuning massive models (10B+ parameters) is prohibitively expensive. (See Low-Rank Adaptation (LoRA) for LMs or Sparrow LM from DeepMind for more details). The optimal number of parameters to freeze remains an open research question. PPO, a mature algorithm with extensive guides, has been favored for scaling RLHF through distributed training. Key RL advancements in RLHF have focused on adapting familiar algorithms to update such large models.

Let’s frame fine-tuning as an RL problem. The policy is a language model taking a prompt and outputting text (or token probability distributions). The action space includes all tokens in the language model’s vocabulary (often ~50k tokens). The observation space is the vast range of input token sequences. The reward function combines the preference model and a policy shift constraint.

The reward function integrates all models into the RLHF process. For a prompt x from the dataset, text y is generated by the current fine-tuned policy iteration. This text, combined with the prompt, is fed into the preference model, yielding a scalar “preferability” score, rθ. Additionally, per-token probability distributions from the RL policy are compared to those from the initial model to penalize policy drift. OpenAI, Anthropic, and DeepMind papers commonly use a scaled Kullback–Leibler (KL) divergence, rKL, between these token distribution sequences. The KL divergence term prevents the RL policy from deviating too much from the initial pretrained model in each training batch, ensuring coherent text output. Without this penalty, optimization might generate nonsensical text that still tricks the reward model into giving high scores. In practice, KL divergence is approximated by sampling from both distributions (explained by John Schulman here). The final reward for the RL update rule is r=rθ−λrKL.

Some RLHF systems incorporate additional reward function terms. For instance, InstructGPT successfully experimented with blending pre-training gradients (from the human annotation set) into the PPO update rule. The reward function’s formulation will likely continue to evolve as RLHF research progresses.

Finally, the update rule is the PPO parameter update that maximizes reward metrics within the current data batch. PPO is a trust region optimization algorithm that constrains gradients to prevent destabilizing the learning process. DeepMind’s Gopher employed a similar reward setup but used synchronous advantage actor-critic (A2C) for gradient optimization, a notable difference not yet externally reproduced.

Note: The diagram simplifies the process. The RL policy generates text, which is then fed into the initial model to obtain probability distributions for the KL penalty. The initial model remains unchanged by gradient updates during training.

RLHF can be iteratively refined by jointly updating the reward model and policy. As the RL policy evolves, user rankings can compare outputs against earlier model versions. Anthropic refers to this as Iterated Online RLHF (see their original paper), where policy iterations are integrated into the ELO ranking system across models. This introduces complex dynamics in the co-evolution of policy and reward models, presenting a complex and ongoing research area.

Open-Source Tools for RLHF

OpenAI released the first code for RLHF on LMs in TensorFlow in 2019.

Currently, several active PyTorch repositories have emerged for RLHF, building upon this foundation. Key repositories include Transformers Reinforcement Learning (TRL), TRLX (a TRL fork), and Reinforcement Learning for Language models (RL4LMs).

TRL is designed for fine-tuning pretrained Hugging Face LMs using PPO. TRLX, an expanded TRL fork by CarperAI, handles larger models for online and offline training. TRLX currently offers a production-ready RLHF API with PPO and Implicit Language Q-Learning ILQL at scales suitable for LLM deployment (e.g., 33 billion parameters). Future versions will support models up to 200B parameters, optimized for machine learning engineers experienced at this scale.

RL4LMs provides tools for fine-tuning and evaluating LLMs with diverse RL algorithms (PPO, NLPO, A2C, TRPO), reward functions, and metrics. Its customizability allows training any encoder-decoder or encoder transformer LM on user-defined reward functions. It’s well-tested and benchmarked across tasks in recent work, encompassing 2000 experiments and providing practical insights on data budget comparison (expert demonstrations vs. reward modeling), reward hacking, and training instabilities. RL4LMs plans include distributed training for larger models and new RL algorithms.

Both TRLX and RL4LMs are under active development, promising further features.

A large dataset from Anthropic is available on the Hub.

What’s Next for RLHF?

Despite the promise and impact of RLHF, attracting significant attention from major AI research labs, limitations remain. Models, while improved, can still produce harmful or factually inaccurate text with undue confidence. This imperfection is a long-term challenge and motivation for RLHF – operating in a human domain means there’s no definitive endpoint for model “completion.”

Deploying RLHF systems involves costly human preference data collection due to the direct integration of human workers outside the training loop. RLHF performance is limited by the quality of human annotations, which are of two types: human-generated text (like in InstructGPT’s initial LM fine-tuning) and human preference labels between model outputs.

Generating high-quality human text for specific prompts is expensive, often requiring hiring staff rather than relying on users or crowdsourcing. Fortunately, reward model training data scales (~50k labeled preference samples) are less expensive. However, costs still exceed academic lab budgets. Currently, only one large-scale dataset exists for general language model RLHF (from Anthropic) and a few smaller task-specific datasets (like summarization data from OpenAI). Another data challenge is annotator disagreement, introducing variance without a ground truth.

Given these limitations, numerous unexplored design options could significantly advance RLHF. Many lie in improving the RL optimizer. PPO is relatively old, and other algorithms could offer benefits. A major feedback fine-tuning cost is evaluating every generated text piece from the policy on the reward model. Offline RL could mitigate this by avoiding costly forward passes. Emerging algorithms like implicit language Q-learning (ILQL) [Talk on ILQL at CarperAI] are well-suited for this optimization type. Other RL trade-offs, like exploration-exploitation balance, are also under-documented. Exploring these directions could deepen our understanding of RLHF and potentially enhance performance.

We hosted a lecture on December 13, 2022, expanding on this post, available here!