Directional Preference Alignment

This is the repo for our ACL'2024 long paper "Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards" by Haoxiang Wang*, Yong Lin*, Wei Xiong*, Rui Yang, Shizhe Diao, Shuang Qiu, Han Zhao, Tong Zhang

arXiv: https://arxiv.org/abs/2402.18571

Code: The multi-objective reward model training code is provided at https://github.com/RLHFlow/RLHF-Reward-Modeling/tree/main/armo-rm

Model: DPA-v1-Mistral-7B

Initialization: SFT checkpoint of Mistral-7B trained on UltraChat-200k
Training Dataset: Ultra-Feedback (same as Zephyr-beta)

Usage

Aligned LLM

Use the code below to get started with our DPA model.

System Prompt:
- Template: "You are a helpful, respectful, and honest assistant who always responds to the user in a harmless way. Your response should maximize weighted rating = helpfulness*{weight_helpfulness} + verbosity*{weight_verbosity}"
- Value Choices: weight_helpfulness is an integer from 0 to 100 and (weight_verbosity/100)**2 + (weight_helpfulness/100)**2 == 1
  - The maximum weight_helpfulness is 100 the lowest suggested value is 71.
  - The model will generate a response that implicitly maximizes the weighted rating helpfulness*weight_helpfulness + verbosity*weight_verbosity, where helpfulness and verbosity are two reward objectives that range from 0 to 100.

We suggest starting with a ratio of weight_verbosity/weight_helpfulness first. For instance, considering weight_verbosity/weight_helpfulness is equal to tan(-15°)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
import numpy as np

# Here we show how to use the DPA model to generate a response to a user prompt.
device = "cuda"
model = AutoModelForCausalLM.from_pretrained("Haoxiang-Wang/DPA-v1-Mistral-7B", torch_dtype=torch.bfloat16, device_map=device)
tokenizer = AutoTokenizer.from_pretrained("Haoxiang-Wang/DPA-v1-Mistral-7B")
degree = -15 # weight_verbosity/weight_helpfulness = tan(-15°)
rad = np.radians(degree) # convert from degree to radian
weight_helpfulness = np.round((np.cos(rad) * 100)).astype(int) # compute weight_helpfulness, scale it by 100x, and round it to an integer
weight_verbosity  = np.round((np.sin(rad) * 100)).astype(int) # compute weight_verbosity, scale it by 100x, and round it to an integer
## Now (weight_helpfulness/100)**2 + (weight_verbosity/100)**2 ≈ 1 - it is not an exact equivalence due to the round() operations above 
sys_prompt = f"You are a helpful, respectful, and honest assistant who always responds to the user in a harmless way. Your response should maximize weighted rating = helpfulness*{weight_helpfulness} + verbosity*{weight_verbosity}"
user_prompt = "Write a summary of Romeo and Juliet."
messages = [
        {"role": "system", "content": sys_prompt},
        {"role": "user", "content": user_prompt},
    ]
input_ids = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(device)
output = model.generate(input_ids=input_ids, max_new_tokens=2048,temperature=0.7)
prompt_len = input_ids.shape[-1]
generated_response = tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)
print(generated_response)
# 'Romeo and Juliet is a tragic love story written by William Shakespeare, believed to have been written between 1591 and 1595. The play is based on an Italian tale called "The Tragical History of Romeus and Juliet" by Arthur Brooke, which was published in 1562.\n\nThe story revolves around two young star-crossed lovers, Romeo Montague and Juliet Capulet, from rival families in Verona, Italy. Their love is forbidden by their families, who have a long-standing feud. Despite the obstacles, Romeo and Juliet marry in secret and spend a few blissful days together before fate intervenes.\n\nA series of misunderstandings, miscommunications, and tragic events lead to the deaths of both Romeo and Juliet. Romeo believes that Juliet is dead, and in a fit of despair, he takes his own life. Juliet, who is actually still alive, awakens to find Romeo dead and takes her own life in grief.\n\nThe play explores themes of love, hate, fate, and the consequences of actions. It is known for its iconic characters, including the passionate Romeo, the fiery Juliet, and the noble Friar Lawrence, who tries to help the young lovers.\n\nRomeo and Juliet has been adapted into numerous films, stage productions, and other media over the years, and it remains a beloved and tragic tale of forbidden love.'

Reward Model

If you are interested in the multi-objective reward model that we trained, you can check out the reward model at RLHFlow/RewardModel-Mistral-7B-for-DPA-v1

It has 10-dimensional output, corresponding to the following attributes from HelpSteer and UltraFeedback ['helpsteer-helpfulness', 'helpsteer-correctness', 'helpsteer-coherence', 'helpsteer-complexity', 'helpsteer-verbosity', 'ultrafeedback-overall_score', "ultrafeedback-instruction_following", "ultrafeedback-truthfulness", "ultrafeedback-honesty", "ultrafeedback-helpfulness"]

Here is a sample code that you can try

from transformers import AutoModelForSequenceClassification,AutoTokenizer
import torch
device = 'cuda'
path = "Haoxiang-Wang/RewardModel-Mistral-7B-for-DPA-v1"
rm = AutoModelForSequenceClassification.from_pretrained(path, trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained(path) 

input_template = "[INST] You must read the following conversation carefully and rate the assistant's response from score 0-100 in these aspects: helpfulness, correctness, coherence, honesty, complexity, verbosity\n\nUser: {prompt}\n\nAssistant: {response} [/INST]"

# Use a sample from HelpSteer validation set
prompt = 'What are some synonyms for the word "beautiful"?'
response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"

model_inputs = tokenizer(input_template.format(prompt=prompt, response=response), return_tensors="pt").to(device)
with torch.no_grad():
    score = rm(**model_inputs).logits.squeeze().cpu().float().numpy()

print(score)
# [68.99269  69.62718  76.23071  33.48785  35.853596 63.833366 55.58917 68.7175 59.552124 46.465595]

# Convert from our scale (0-100) to HelpSteer scale (0-4) 
helpsteer_rewards_pred = (score[:5]-10)/20
print(helpsteer_rewards_pred)
# [2.9496346 2.981359  3.3115356 1.1743925 1.2926798]
# The actual rewards from the HelpSteer dataset for this sample are [3,3,4,2,2]

Abstract

Fine-grained control over large language models (LLMs) remains a significant challenge, hindering their adaptability to diverse user needs. While Reinforcement Learning from Human Feedback (RLHF) shows promise in aligning LLMs, its reliance on scalar rewards often limits its ability to capture diverse user preferences in real-world applications. To address this limitation, we introduce the Directional Preference Alignment (DPA) framework. Unlike the scalar-reward RLHF, DPA incorporates multi-objective reward modeling to represent diverse preference profiles. Additionally, DPA models user preferences as directions (i.e., unit vectors) in the reward space to achieve user-dependent preference control. Our method involves training a multi-objective reward model and then fine-tuning the LLM with a preference-conditioned variant of Rejection Sampling Finetuning (RSF), an RLHF method adopted by Llama 2. This method enjoys a better performance trade-off across various reward objectives. In comparison with the scalar-reward RLHF, DPA offers users intuitive control over LLM generation: they can arithmetically specify their desired trade-offs (e.g., more helpfulness with less verbosity). We also validate the effectiveness of DPA with real-world alignment experiments on Mistral-7B. Our method provides straightforward arithmetic control over the trade-off between helpfulness and verbosity while maintaining competitive performance with strong baselines such as Direct Preference Optimization (DPO).

Arithmetic Control of LLMs

Arithmetic Prompting: Specify desired tradeoff of different reward objectives (e.g., helpfulness and verbosity) with a unit vector, such as (1,0), (0.8, -0.6) or (0,1).