This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: e5mD3YMSm35_9Zj1LuiIXELFFNWkP7hDTpRSNdiUDUg
Cover

Analyzing Reward Functions and Equivalence Classes

Written by @textmodels | Published on 2024/8/26

TL;DR
Theorem 1 establishes that reward functions can be reparameterized using a reference model. Proposition 1 proves that this reparameterization is unique for each equivalence class of reward functions. These results provide insights into the structure and relationships between different reward functions.

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

A.6 Proof of Theorem 1

In this section, we will expand on the results of Theorem 1.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

[story continues]


Written by
@textmodels
We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.

Topics and
tags
ai-fine-tuning|direct-preference-optimization|reinforcement-learning|language-models|language-model-optimization|reward-modeling|bradley-terry-model|rhlf-explained
This story on HackerNoon has a decentralized backup on Sia.
Transaction ID: e5mD3YMSm35_9Zj1LuiIXELFFNWkP7hDTpRSNdiUDUg