TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilin...

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

Vincent Koc

2025-05-17Prompt EngineeringLarge Language ModelTinyQA Benchmark++MMLU
PaperPDFCode(official)

Abstract

Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

Results

TaskDatasetMetricValueModel
TinyQA Benchmark++tinyqabenchmark_core-enExact Match86.5gemma-3-4b
TinyQA Benchmark++tinyqabenchmark_core-enExact Match84.6mistral-24b-instruct
TinyQA Benchmark++tinyqabenchmark_core-enExact Match84.6llama-3.2-3b-instruct
TinyQA Benchmark++tinyqabenchmark_core-enExact Match80.8ministral-8b
TinyQA Benchmark++tinyqabenchmark_core-enExact Match76.9ministral-3b
TinyQA Benchmark++tinyqabenchmark_core-enExact Match53.8llama-3.2-1b-instruct
TinyQA Benchmark++tinyqabenchmark_core-enExact Match50mistral-7b-instruct
TinyQA Benchmark++tinyqabenchmark_core-enExact Macth90.4gemma-3-12b

Related Papers

DENSE: Longitudinal Progress Note Generation with Temporal Modeling of Heterogeneous Clinical Notes Across Hospital Visits2025-07-18Leveraging Language Prior for Infrared Small Target Detection2025-07-17Emotional Support with LLM-based Empathetic Dialogue Generation2025-07-17GeoReg: Weight-Constrained Few-Shot Regression for Socio-Economic Estimation using LLM2025-07-17The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations2025-07-17Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities2025-07-17Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities2025-07-17Assay2Mol: large language model-based drug design using BioAssay context2025-07-16