Papers With Code 2 | ML Benchmarks, SotA Results & Code

In the Learning to Summarize from Human Feedback paper, a reward model was trained from human feedback. The reward model was then used to train a summarization model to align with human preferences. This is the dataset of human feedback that was released for reward modelling. There are two parts of this dataset: comparisons and axis. In the comparisons part, human annotators were asked to choose the best out of two summaries. In the axis part, human annotators gave scores on a likert scale for the quality of a summary. The comparisons part only has a train and validation split, and the axis part only has a test and validation split.

Li et al. propose a variant with a subset of workers who annotate the data (details in Appendix C.1)

Reddit TL;DR (Seen) uses the top 10 workers from the original dataset.
Reddit TL;DR (Unseen) uses unseen workers in the validation set.

Summarize from Feedback