MOCHA
Contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation.
Source: MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics