Papers With Code 2 | ML Benchmarks, SotA Results & Code

Ultra-lightweight, multilingual QA eval dataset for rapid testing LLMs.

Dataset Characteristics:

Multilingual: Includes packs for Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish.
Compact: Contains a curated English gold-standard set of 52 QA pairs (under 20kB), enabling immediate and resource-friendly evaluations.
Synthetic Generation: Features a LiteLLM-powered synthetic data generator see tinyqabenchmarkpp, allowing quick creation of custom evaluation sets tailored to specific domains or languages.
Metadata Support: Provided in Croissant-compatible formats, ready for seamless integration with modern evaluation harnesses and CI tools.

Motivation and Content Summary:

The primary motivation behind TQB++ is to enable rapid iteration and continuous integration (CI) of language models. Existing evaluation benchmarks typically involve significant computational overhead and slow feedback loops. In contrast, TQB++ offers near-instantaneous assessments of model performance and prompt stability across multiple languages. It is particularly sensitive to issues such as prompt-template regressions, tokenizer drift, and fine-tuning side effects.

Potential Use Cases:

Continuous Integration (CI): Immediate detection of breaking changes or regressions in LLM pipelines.
Multilingual Model Validation: Quickly assess model accuracy and performance across multiple languages without large compute costs.
Prompt Optimization and Testing: Ideal for iterative prompt refinement workflows, enabling fast feedback loops and effective tuning.
Teaching and Prototyping: Educational use in courses or workshops, showcasing multilingual LLM evaluation in real-time scenarios.

TQBA++