First, we fine-tune a pre-trained language model on a dataset of human-labeled examples. For simplicity, let’s assume we have a small dataset of prompts and responses.
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer, AdamW# Load pre-trained GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# Example dataset (prompts and responses)
prompts = ["What is RLHF?", "Explain reinforcement learning."]
responses = ["RLHF is a technique for aligning models with human preferences.",
"Reinforcement learning is a type of machine learning where an agent learns by interacting with an environment."]
# Tokenize the dataset
inputs = tokenizer(prompts, return_tensors="pt", padding=True, truncation=True)
labels = tokenizer(responses, return_tensors="pt", padding=True, truncation=True).input_ids
# Fine-tune the model
optimizer = AdamW(model.parameters(), lr=5e-5)
for epoch in range(3): # Fine-tune for 3 epochs
outputs = model(**inputs, labels=labels)
loss = outputs.loss
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")
Next, we train a reward model to predict human preferences. We’ll use a simple neural network that takes model outputs as input and predicts a reward score. Note that this model is just for demonstration. In real life, dataset is constructed from human feedback then we use Bradley-Terry model or similar approaches to map preferences.
import torch.nn as nnclass RewardModel(nn.Module):
def __init__(self, input_size, hidden_size):
super(RewardModel, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
return self.fc2(x)
# Example: Generate embeddings for model outputs
output_embeddings = model(**inputs).last_hidden_state.mean(dim=1) # Average pooling
# Initialize reward model
reward_model = RewardModel(input_size=output_embeddings.size(1), hidden_size=64)
# Example human feedback (1 = preferred, 0 = not preferred)
preferred_outputs = output_embeddings[0].unsqueeze(0) # Preferred output
non_preferred_outputs = output_embeddings[1].unsqueeze(0) # Non-preferred output
# Train the reward model
optimizer = torch.optim.Adam(reward_model.parameters(), lr=1e-4)
criterion = nn.MSELoss()
for epoch in range(10): # Train for 10 epochs
preferred_reward = reward_model(preferred_outputs)
non_preferred_reward = reward_model(non_preferred_outputs)
# Maximize the margin between preferred and non-preferred rewards
loss = criterion(preferred_reward, torch.tensor([1.0])) + criterion(non_preferred_reward, torch.tensor([0.0]))
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")
Finally, we use the reward model to fine-tune the language model using reinforcement learning. We’ll use the Proximal Policy Optimization (PPO) algorithm for this step. If you don’t know what PPO is, that’s okay. Consider it like this; first model tries couple of random actions(in LLM’s case, predicting next token) and gets a reward from that. By looking at the reward, it will try more and more close next-token-predictions to what human feedback showed.
from torch.distributions import Categorical# PPO hyperparameters
clip_epsilon = 0.2
gamma = 0.99
# Generate model outputs
outputs = model(**inputs)
logits = outputs.logits
probs = torch.softmax(logits, dim=-1)
dist = Categorical(probs)
# Sample actions (tokens) from the model
actions = dist.sample()
# Compute rewards using the reward model
output_embeddings = model(**inputs).last_hidden_state.mean(dim=1)
rewards = reward_model(output_embeddings)
# Compute PPO loss
old_probs = dist.log_prob(actions)
with torch.no_grad():
old_values = reward_model(output_embeddings)
# Compute advantages
advantages = rewards - old_values
# PPO objective
ratio = torch.exp(dist.log_prob(actions) - old_probs)
surr1 = ratio * advantages
surr2 = torch.clamp(ratio, 1 - clip_epsilon, 1 + clip_epsilon) * advantages
ppo_loss = -torch.min(surr1, surr2).mean()
# Update the model
optimizer.zero_grad()
ppo_loss.backward()
optimizer.step()
print(f"PPO Loss: {ppo_loss.item()}")
Clipping happens because too great steps leads to worst performance.
Bonus: In the end, it looks for the original model’s output and updated models output. By using KL-divergence, we stay true to the model’s nature and prevent reward hacking.
While RLHF is powerful, it comes with challenges:
- Scalability: Collecting human feedback is resource-intensive.
- Bias: Human annotators may introduce biases into the reward model.
- Reward Hacking: The model might exploit the reward model to maximize rewards without truly aligning with human values.
Future research aims to address these challenges by improving reward models, reducing reliance on human feedback, and ensuring fairness and robustness.