Datasets

3,148 machine learning datasets

3,148 dataset results

ShopTC-100K

ShopTC-100K Dataset The ShopTC-100K dataset is collected using TermMiner, an open-source data collection and topic modeling pipeline introduced in the paper:

1 papers0 benchmarksTexts

CCPT (Conceptual Combination with Property Type)

CCPT is a dataset containing 12.3K triplets of noun phrases, properties, and property types for conceptual combination.

1 papers0 benchmarksTexts

CompMix-IR

CompMix-IR Dataset Overview:

1 papers0 benchmarksGraphs, Tabular, Texts

RLM25 (Research-Level Mathematics 2025)

RLM25 is an evaluation benchmark containing 619 paired examples of research-level natural language mathematical statements and their corresponding Lean formalizations. The examples are drawn from six real-world formalization projects, and each entry includes essential context along with timestamp information to ensure freshness and avoid data contamination. Its primary purpose is to serve as a challenging testbed for assessing autoformalization systems, helping researchers measure improvements in translating complex, research-level mathematics into formal code.

1 papers0 benchmarksTexts

ProofNetVerif

ProofNetVerif is an evaluation benchmark comprising 3,752 entries, each including an informal mathematical statement, its reference formalization, a predicted formalization, and a binary label indicating semantic equivalence. It is designed to assess autoformalization metrics by providing a challenging testbed for both reference-based and reference-free evaluation approaches.

1 papers0 benchmarksTexts

TUMTraffic-VideoQA

TUMTraffic-VideoQA is a novel dataset designed to understand spatiotemporal video in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatiotemporal object expressions, TUMTraffic-VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatiotemporal object grounding—within a cohesive evaluation framework.

1 papers0 benchmarksImages, Texts, Videos

TextAtlas5M

We introduce TextAtlas5M, a dataset specifically designed for training and evaluating multimodal generation models on dense-text image generation.

1 papers0 benchmarksImages, Texts

PhysiCo

Physical concept understanding benchmark.

1 papers0 benchmarksImages, Texts

VideoDB's OCR Benchmark Public Collection

Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It spans multiple categories—ranging from finance and legal documents to software UI elements and handwritten notes—ensuring a broad representation of real-world text appearances. Each video is annotated with frame indexes to facilitate consistent and reproducible OCR benchmarks. Currently, the dataset includes over 25 curated videos, yielding thousands of extracted frames that present a variety of text-related challenges.

1 papers3 benchmarksImages, Texts, Videos

M3LS (Multi-Lingual Multi-Modal Summarization Dataset)

Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity a

1 papers0 benchmarksImages, Texts

MAKED (MultiModal MultiLingual Summarization and Keyword Extraction Dataset)

Keyword extraction is an integral task for many downstream problems like clustering, recommendation, search and classification. Development and evaluation of keyword extraction techniques require an exhaustive dataset; however, currently, the community lacks large-scale multi-lingual datasets. In this paper, we present MAKED, a large-scale multi-lingual keyword extraction dataset comprising of 540K+ news articles from British Broadcasting Corporation News (BBC News) spanning 20 languages. It is the first keyword extraction dataset for 11 of these 20 languages. The quality of the dataset is examined by experimentation with several baselines. We believe that the proposed dataset will help advance the field of automatic keyword extraction given its size, diversity in terms of languages used, topics covered and time periods as well as its focus on under-studied languages.

1 papers0 benchmarksImages, Texts

Speech Brown

Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.

1 papers0 benchmarksSpeech, Texts

Multi Lingual Bug Reports

Dataset Description The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, specifically focusing on those labeled with the english-please tag. This label indicates that the original submission was written in a language other than English, providing a clear signal for multilingual content. The dataset spans a five-year period (March 2019--June 2024), ensuring a diverse representation of bug types, user environments, and technical contexts.

1 papers1 benchmarksGraphs, Images, Texts

R1-Onevision

The R1-Onevision dataset is a meticulously crafted resource designed to empower models with advanced multimodal reasoning capabilities. Aimed at bridging the gap between visual and textual understanding, this dataset provides rich, context-aware reasoning tasks across diverse domains, including natural scenes, science, mathematical problems, OCR-based content, and complex charts.

1 papers0 benchmarksImages, Texts

VisCon-1M

VisCon-100K is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset

1 papers0 benchmarksImages, Texts

VisCon-100K

1 papers0 benchmarksImages, Texts

GENEUTRAL

Dataset Card for Dataset Name

1 papers0 benchmarksTexts

GENTER (GEnder Name TEmplates with pRonouns)

This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .

1 papers0 benchmarksTexts

NAMEXTEND

This dataset extends NAMEXACT by including words that can be used as names, but may not exclusively be used as names in every context.

1 papers0 benchmarksTexts

NAMEXACT

This dataset contains names that are exclusively associated with a single gender and that have no ambiguous meanings, therefore being exact with respect to both gender and meaning.

1 papers0 benchmarksTexts

PreviousPage 146 of 158Next