Human Feedback Reinforcement Learning: Aligning AI with Human Values

The past few years have witnessed remarkable advancements in language models, showcasing their ability to generate diverse and compelling text from simple human prompts. However, defining what constitutes “good” text remains a complex challenge, as it is inherently subjective and context-dependent. Whether it’s the creativity desired in storytelling, the factual accuracy needed in informative content, or the executability required in code snippets, the criteria vary significantly.

Creating a loss function that effectively captures these nuanced attributes has proven to be incredibly difficult. Consequently, most language models are still trained using a basic next token prediction loss, such as cross-entropy. To address the limitations of this approach, metrics like BLEU and ROUGE have been developed to better assess text quality by comparing generated text to references. While these metrics offer improvements over simple loss functions, they are still rule-based comparisons and fall short of truly reflecting human preferences. Imagine if we could directly leverage human feedback to evaluate generated text and, even more powerfully, use this feedback as a loss function to optimize the model itself. This is the core idea behind Reinforcement Learning from Human Feedback (RLHF), a methodology that employs reinforcement learning techniques to directly optimize language models based on human feedback. RLHF is pivotal in enabling language models to align their training on vast datasets with the intricate and often subtle values of human beings.

The recent success of RLHF is most prominently illustrated by its application in ChatGPT. Given ChatGPT’s impressive capabilities, we prompted it to explain RLHF:

While ChatGPT’s explanation is surprisingly insightful, it doesn’t encompass the full picture. Let’s delve deeper and fill in the gaps to provide a comprehensive understanding of RLHF.

RLHF: A Step-by-Step Breakdown

Reinforcement Learning from Human Feedback, also known as Reinforcement Learning from human preferences, is a multifaceted concept involving a multi-model training process and distinct deployment phases. In this article, we will dissect the training process into three fundamental steps:

Pre-training a language model (LM).
Gathering data and training a reward model.
Fine-tuning the LM using reinforcement learning.

Let’s begin by examining the initial stage: pre-training language models.

Pre-training Language Models

RLHF typically starts with a language model that has already undergone pre-training using conventional pre-training objectives. For a deeper understanding of this process, refer to this blog post. OpenAI utilized a smaller version of GPT-3 for their pioneering RLHF model, InstructGPT. Anthropic’s research papers detail the use of transformer models ranging from 10 million to 52 billion parameters for this initial phase. DeepMind has documented the use of models as large as their 280 billion parameter model, Gopher. It is highly probable that these leading AI organizations employ even larger models in their RLHF-powered products today.

This initial model can be further fine-tuned on additional text or under specific conditions, although this is not always necessary. For instance, OpenAI fine-tuned their model on human-generated text considered “preferable,” and Anthropic created their initial LM for RLHF by distilling an original LM based on context clues related to their “helpful, honest, and harmless” criteria. These methods represent sources of valuable, augmented data, but they are not essential for grasping the core principles of RLHF. The critical starting point for RLHF is having a model that demonstrates robust responsiveness to diverse instructions.

Determining the “best” model for initiating RLHF is not straightforward. This ambiguity is a recurring theme in RLHF, as the design space of options within RLHF training remains largely unexplored.

Once a suitable language model is established, the next crucial step is to generate data for training a reward model. This reward model is the mechanism through which human preferences are integrated into the system.

Illustration of the pre-training phase in Reinforcement Learning from Human Feedback (RLHF), where a base language model is initially trained on a large corpus of text data.

Reward Model Training

The development of a reward model (RM), also known as a preference model, that aligns with human preferences is a key innovation within RLHF research. The primary objective is to create a model or system that can take a text sequence as input and output a scalar reward. This reward should numerically represent the degree of human preference for the given text. This system could be an end-to-end language model or a modular setup that outputs a reward, such as a model that ranks outputs and converts the ranking into a reward score. The scalar nature of the reward is essential for seamless integration with existing Reinforcement Learning algorithms in subsequent RLHF stages.

Reward models can be either fine-tuned language models or models trained from scratch using preference data. Anthropic, for example, has utilized a specialized fine-tuning method called preference model pre-training (PMP) for initializing these models after pre-training. They found PMP to be more sample-efficient than standard fine-tuning. However, no single base model is universally recognized as the optimal choice for reward models.

The training dataset for the reward model, consisting of prompt-generation pairs, is created by sampling prompts from a predefined dataset. (Anthropic’s data, primarily generated using a chat tool on Amazon Mechanical Turk, is publicly available on the Hugging Face Hub as the Anthropic/hh-rlhf dataset. OpenAI utilized prompts submitted by users to the GPT API.) These prompts are then fed into the initial language model to generate text outputs.

Human annotators are then tasked with ranking these generated text outputs based on preference. Initially, one might consider having humans directly assign scalar scores to each text piece to train the reward model. However, this approach is often impractical due to the inherent subjectivity and variability in human scoring, leading to uncalibrated and noisy data. Instead, using rankings to compare outputs from multiple models provides a more robust and regularized dataset.

Various methods exist for ranking text. A successful approach involves having users compare generated text from two language models conditioned on the same prompt in head-to-head matchups. By comparing model outputs in this manner, an Elo system can be used to establish a ranking of models and their outputs relative to each other. These different ranking methodologies are then normalized to produce a scalar reward signal suitable for training the reward model.

An interesting observation from successful RLHF implementations is the varying sizes of reward language models relative to the text generation models. For example, OpenAI used a 175B parameter LM with a 6B reward model, while Anthropic employed LMs and reward models ranging from 10B to 52B. DeepMind utilized 70B Chinchilla models for both LM and reward functions. Intuitively, preference models may require a similar capacity to comprehend the text they evaluate as the models that generate the text in the first place.

At this stage of the RLHF process, we have an initial language model capable of generating text and a reward model that can assess text and assign a score reflecting human preference. The next step involves using reinforcement learning (RL) to optimize the initial language model based on the feedback from the reward model.

Diagram illustrating the reward model training phase of Reinforcement Learning from Human Feedback (RLHF), where human feedback is used to train a model to predict human preferences.

Fine-tuning with RL

For a considerable period, training language models with reinforcement learning was considered a daunting task, both from an engineering and algorithmic standpoint. However, several organizations have successfully fine-tuned language models using policy-gradient RL algorithms, specifically Proximal Policy Optimization (PPO). This fine-tuning typically involves a copy of the initial language model, where some or all parameters are adjusted. Freezing some parameters is often necessary because fine-tuning an entire 10B or 100B+ parameter model can be computationally prohibitive. (For more efficient fine-tuning techniques, explore Low-Rank Adaptation (LoRA) for LMs or DeepMind’s Sparrow LM). The optimal strategy for determining how many parameters to freeze or fine-tune remains an active area of research. PPO is a well-established algorithm with extensive resources and guides available (e.g., OpenAI Spinning Up and Hugging Face blog on PPO). Its maturity made it a suitable choice for scaling RLHF to large language models using distributed training. Many core RL advancements in RLHF have focused on adapting and scaling familiar algorithms like PPO to update these massive models effectively.

Let’s formalize this fine-tuning task as a reinforcement learning problem. The policy is the language model, which takes a prompt as input and generates a text sequence (or probability distributions over text). The action space encompasses all tokens in the language model’s vocabulary (often around 50,000 tokens). The observation space is the distribution of possible input token sequences, which is vast compared to typical RL applications. The reward function integrates the preference model with a constraint on policy shift.

The reward function is where all components of the RLHF system converge. Given a prompt, x, from the dataset, the current iteration of the fine-tuned policy generates text y. This combined prompt and text is fed into the preference model, which outputs a scalar “preferability” score, rθ. Additionally, to prevent drastic deviations from the initial pre-trained model, per-token probability distributions from the RL policy are compared to those from the initial model. A penalty based on the Kullback–Leibler (KL) divergence, rKL, between these distributions is applied. This KL divergence term, scaled by a factor λ, discourages the RL policy from significantly diverging from the initial pre-trained model in each training batch. This helps maintain the coherence of the generated text, preventing the optimization process from generating nonsensical text that might still trick the reward model into giving a high score. In practice, KL divergence is approximated through sampling from both distributions, as explained by John Schulman here. The final reward signal used for the RL update rule is calculated as r = rθ – λrKL.

Some RLHF systems incorporate additional terms into the reward function. For instance, OpenAI successfully experimented with InstructGPT by incorporating additional pre-training gradients (from the human annotation set) into the PPO update rule. As RLHF research progresses, the formulation of this reward function is likely to continue evolving.

Finally, the update rule is the parameter update derived from PPO, which aims to maximize the reward metrics within the current data batch. (PPO is an on-policy algorithm, meaning parameters are updated based only on the current batch of prompt-generation pairs.) PPO is a trust region optimization algorithm that uses gradient constraints to ensure stable learning. DeepMind’s Gopher used a similar reward setup but employed synchronous advantage actor-critic (A2C) for gradient optimization, a notable difference that has not yet been widely replicated.

Visual representation of the fine-tuning stage in Reinforcement Learning from Human Feedback (RLHF), where reinforcement learning algorithms are used to optimize the language model based on rewards from the reward model.

Technical Note: While the diagram might suggest separate responses from both models for the same prompt, in reality, the RL policy generates text, and this text is then fed into the initial model to obtain its relative probabilities for the KL penalty calculation. The initial model remains unchanged by gradient updates during this training phase.

Optionally, RLHF can be extended through iterative updates of both the reward model and the policy. As the RL policy evolves, human annotators can continue to rank outputs from the current policy against earlier versions. Most research papers have yet to extensively explore this iterative approach, primarily because the data collection mode required for this type of continuous feedback is best suited for dialogue agents with an active user base. Anthropic refers to this as Iterated Online RLHF (detailed in their original paper), where different policy iterations are integrated into the ELO ranking system across models. This introduces complex dynamics as both the policy and reward model evolve concurrently, presenting a challenging and open research area.

Open-Source Tools for RLHF

The first publicly available code for performing RLHF on LMs was released by OpenAI in TensorFlow in 2019.

Currently, several active repositories for RLHF in PyTorch have emerged, building upon this initial work. Key repositories include Transformers Reinforcement Learning (TRL), TRLX, which originated as a fork of TRL, and Reinforcement Learning for Language models (RL4LMs).

TRL is designed for fine-tuning pre-trained LMs within the Hugging Face ecosystem using PPO. TRLX, an expanded fork of TRL by CarperAI, is engineered to handle larger models for both online and offline training. TRLX currently offers a production-ready API for RLHF with PPO and Implicit Language Q-Learning ILQL at the scales necessary for large language model deployment (e.g., 33 billion parameters). Future versions of TRLX aim to support models up to 200B parameters. Interfacing with TRLX is optimized for machine learning engineers experienced with large-scale models.

RL4LMs provides building blocks for fine-tuning and evaluating LLMs with a broad spectrum of RL algorithms (PPO, NLPO, A2C, and TRPO), reward functions, and metrics. Its highly customizable nature allows for training any encoder-decoder or encoder transformer-based LM with user-defined reward functions. Notably, RL4LMs is well-tested and benchmarked across diverse tasks in recent research, encompassing up to 2000 experiments and offering practical insights into data budget comparisons (expert demonstrations vs. reward modeling), handling reward hacking, and addressing training instabilities. Future development plans for RL4LMs include distributed training for larger models and the integration of new RL algorithms.

Both TRLX and RL4LMs are under active development, with more features anticipated soon.

A significant dataset created by Anthropic is also available on the Hugging Face Hub, providing valuable resources for the RLHF community.

What’s Next for RLHF?

While RLHF techniques are exceptionally promising and impactful, capturing the attention of major AI research labs, they still face clear limitations. Despite improvements, models can still produce harmful or factually inaccurate text without indicating uncertainty. This imperfection is a persistent challenge and a key motivator for RLHF’s ongoing development. Operating in a domain deeply intertwined with human values means there may never be a definitive endpoint for model “completeness.”

Deploying RLHF systems involves a significant cost in gathering human preference data due to the direct integration of human input outside the automated training loop. RLHF performance is fundamentally limited by the quality of human annotations, which come in two primary forms: human-generated text, as used in fine-tuning the initial LM in InstructGPT, and human preference labels between model outputs.

Generating high-quality human-written text to answer specific prompts is resource-intensive, often requiring dedicated part-time staff rather than relying on product users or crowdsourcing. Fortunately, the scale of data needed to train reward models for most RLHF applications (approximately 50,000 labeled preference samples) is less prohibitive. However, it still represents a higher cost than many academic labs can readily afford. Currently, only one large-scale dataset for RLHF on a general language model (from Anthropic) and a few smaller task-specific datasets (such as summarization data from OpenAI) exist. Another challenge in RLHF data collection is the potential for disagreement among human annotators, introducing variance into the training data without a clear ground truth.

Despite these limitations, numerous unexplored design options could significantly advance RLHF. Many of these lie in enhancing the RL optimizer. PPO, while widely used, is a relatively mature algorithm, and there’s no fundamental reason why other algorithms couldn’t offer advantages within the RLHF workflow. A significant computational cost in fine-tuning the LM policy is the need to evaluate every generated text piece from the policy using the reward model, as it acts as part of the environment in the standard RL framework. To mitigate these expensive forward passes of large models, offline RL techniques could be employed as policy optimizers. Recent algorithms like implicit language Q-learning (ILQL) [Talk on ILQL at CarperAI] are particularly well-suited for this type of optimization. Other crucial trade-offs in the RL process, such as exploration-exploitation balance, also remain under-explored in the context of RLHF. Investigating these areas promises to deepen our understanding of RLHF’s mechanisms and potentially lead to improved performance.

We hosted a lecture on Tuesday, December 13, 2022, that further expanded on this topic. You can watch it here!