3,148 machine learning datasets
3,148 dataset results
ShopTC-100K Dataset The ShopTC-100K dataset is collected using TermMiner, an open-source data collection and topic modeling pipeline introduced in the paper:
CCPT is a dataset containing 12.3K triplets of noun phrases, properties, and property types for conceptual combination.
CompMix-IR Dataset Overview:
RLM25 is an evaluation benchmark containing 619 paired examples of research-level natural language mathematical statements and their corresponding Lean formalizations. The examples are drawn from six real-world formalization projects, and each entry includes essential context along with timestamp information to ensure freshness and avoid data contamination. Its primary purpose is to serve as a challenging testbed for assessing autoformalization systems, helping researchers measure improvements in translating complex, research-level mathematics into formal code.
ProofNetVerif is an evaluation benchmark comprising 3,752 entries, each including an informal mathematical statement, its reference formalization, a predicted formalization, and a binary label indicating semantic equivalence. It is designed to assess autoformalization metrics by providing a challenging testbed for both reference-based and reference-free evaluation approaches.
TUMTraffic-VideoQA is a novel dataset designed to understand spatiotemporal video in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatiotemporal object expressions, TUMTraffic-VideoQA unifies three essential tasks—multiple-choice video question answering, referred object captioning, and spatiotemporal object grounding—within a cohesive evaluation framework.
We introduce TextAtlas5M, a dataset specifically designed for training and evaluating multimodal generation models on dense-text image generation.
Physical concept understanding benchmark.
Dataset Introduction This dataset leverages VideoDB's Public Collection to offer a diverse range of videos featuring text-containing scenes. It spans multiple categories—ranging from finance and legal documents to software UI elements and handwritten notes—ensuring a broad representation of real-world text appearances. Each video is annotated with frame indexes to facilitate consistent and reproducible OCR benchmarks. Currently, the dataset includes over 25 curated videos, yielding thousands of extracted frames that present a variety of text-related challenges.
Significant developments in techniques such as encoder-decoder models have enabled us to represent information comprising multiple modalities. This information can further enhance many downstream tasks in the field of information retrieval and natural language processing; however, improvements in multi-modal techniques and their performance evaluation require large-scale multi-modal data which offers sufficient diversity. Multi-lingual modeling for a variety of tasks like multi-modal summarization, text generation, and translation leverages information derived from high-quality multi-lingual annotated data. In this work, we present the current largest multi-lingual multi-modal summarization dataset (M3LS), and it consists of over a million instances of document-image pairs along with a professionally annotated multi-modal summary for each pair. It is derived from news articles published by British Broadcasting Corporation(BBC) over a decade and spans 20 languages, targeting diversity a
Keyword extraction is an integral task for many downstream problems like clustering, recommendation, search and classification. Development and evaluation of keyword extraction techniques require an exhaustive dataset; however, currently, the community lacks large-scale multi-lingual datasets. In this paper, we present MAKED, a large-scale multi-lingual keyword extraction dataset comprising of 540K+ news articles from British Broadcasting Corporation News (BBC News) spanning 20 languages. It is the first keyword extraction dataset for 11 of these 20 languages. The quality of the dataset is examined by experimentation with several baselines. We believe that the proposed dataset will help advance the field of automatic keyword extraction given its size, diversity in terms of languages used, topics covered and time periods as well as its focus on under-studied languages.
Dataset Summary Speech Brown is a comprehensive, synthetic, and diverse paired speech-text dataset in 15 categories, covering a wide range of topics from fiction to religion. This dataset consists of over 55,000 sentence-level samples.
Dataset Description The dataset used in this study comprises bug reports extracted from the Visual Studio Code GitHub repository, specifically focusing on those labeled with the english-please tag. This label indicates that the original submission was written in a language other than English, providing a clear signal for multilingual content. The dataset spans a five-year period (March 2019--June 2024), ensuring a diverse representation of bug types, user environments, and technical contexts.
The R1-Onevision dataset is a meticulously crafted resource designed to empower models with advanced multimodal reasoning capabilities. Aimed at bridging the gap between visual and textual understanding, this dataset provides rich, context-aware reasoning tasks across diverse domains, including natural scenes, science, mathematical problems, OCR-based content, and complex charts.
VisCon-100K is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset
VisCon-100K is a dataset specially designed to facilitate fine-tuning of vision-language models (VLMs) by leveraging interleaved image-text web documents. Derived from 45K web documents of the OBELICS dataset, this release contains 100K image conversation samples. GPT-4V is used to generate image-contextual captions, while OpenChat 3.5 converts these captions into diverse free-form and multiple-choice Q&A pairs. This approach not only focuses on fine-grained visual content but also incorporates the accompanying web context to yield superior performance. Using the same pipeline, but substituting our trained contextual captioner for GPT-4V, we also release the larger VisCon-1M dataset
Dataset Card for Dataset Name <!-- Provide a quick summary of the dataset. -->
This dataset consists of template sentences associating first names ([NAME]) with third-person singular pronouns ([PRONOUN]), e.g., [NAME] asked , not sounding as if [PRONOUN] cared about the answer . after all , [NAME] was the same as [PRONOUN] 'd always been . there were moments when [NAME] was soft , when [PRONOUN] seemed more like the person [PRONOUN] had been .
This dataset extends NAMEXACT by including words that can be used as names, but may not exclusively be used as names in every context.
This dataset contains names that are exclusively associated with a single gender and that have no ambiguous meanings, therefore being exact with respect to both gender and meaning.