Papers With Code 2 | ML Benchmarks, SotA Results & Code

Dataset is a CSV file, that contains evaluation scores given by a panel of LLMs to responses produced by other LLMs . Responses regard a forecasting task assigned to multiple LLMs. The evaluation of the individual forecasts are performed according to 9 criteria indicated in the prompt. (see for details https://arxiv.org/abs/2412.09385).

Data are organised as follows. Each row in the dataset represents a forecast evaluation. In columns: the forecaster number; marks for each criterium; mean of the marks; Arena score of the evaluator LLM (as per July 12, 2024).

Scores can be further elaborated to evaluate the ability of LLM panels to assess forecasts. For example you can compute Intraclass Correlation Coefficients (ICC) to evaluate consistency and coherence of the panel evaluation. You can also process marks according to classical statistics or ranking evaluation algorithms (e.g. Kendall distance) or formation of optimized subsets.

We also provide apps to upload the dataset, filter the data, compute ICC values, and visualize the results through heatmaps.

LLM evaluation scores