Datasets

3,148 machine learning datasets

3,148 dataset results

CATT (CATT Arabic Diacritization Benchmark Dataset)

The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple topics including science and technology, economics, politics, sports, arts, and culture. It was manually diacritized by two expert native Arabic speakers and then validated by a third expert. This dataset contains names of people and places in both Arabic and English. As for the English names, they are written in Arabic letters and diacritized based on their pronunciation. Also, the numbers in the sentences are written in textual form rather than the numeric form which helps in evaluating the models without the need for a text normalizer (TN).

3 papers2 benchmarksTexts

OA-Mine - annotations

The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The products may have been taken down from Amazon since the collection of the dataset.

3 papers3 benchmarksTexts

ToolLens

The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out of a total of 464, designed to better mimic real-world user interactions.

3 papers1 benchmarksTexts

RLHF-V Dataset

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

3 papers0 benchmarksImages, Texts

Kvasir-VQA (A Text-Image Pair GI Tract Dataset)

The Kvasir-VQA dataset is an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations. This dataset is designed to facilitate advanced machine learning tasks in gastrointestinal (GI) diagnostics, including image captioning, Visual Question Answering (VQA) and text-based generation of synthetic medical images.

3 papers0 benchmarksImages, Medical, Tabular, Texts

P2GB

A benchmark designed to evaluate MLLMs’ proficiency in understanding inter-object relationships and textual content.

3 papers0 benchmarksImages, Texts

Amazon Digital Music (Amazon Digital Music 5-core)

This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).

3 papers3 benchmarksImages, Texts

WebApp1K-React

Test-driven benchmark to challenge LLMs to write JavaScript React application

3 papers1 benchmarksTexts

EC-FUNSD

EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the entity-centric robustness evaluation of pre-trained text-and-layout models (PTLMs).

3 papers2 benchmarksImages, Texts

ROOR

ROOR is a reading order prediction (ROP) benchmark which annotates layout reading order as ordering relations.

3 papers1 benchmarksImages, Texts

LLaVA-CoT-100K

The dataset was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step.

3 papers0 benchmarksImages, Texts

TriBERT

TriBERT dataset consists of 12,049 training, 2,527 validation and 2,560 test Human-Machine collaborative texts. Each text contains both human-written and LLM-generated parts, which can appear in different orders (human → AI, AI → human). Therefore, each sample has between 1 and 3 boundaries, indicating the sentences where authorship changes. The texts were created using humanwritten essays with LLM-generated sections added using ChatGPT.

3 papers0 benchmarksTexts

MM-OR

Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi

3 papers7 benchmarks3D, Audio, Graphs, Images, Medical, Point cloud, RGB-D, Speech, Texts, Time series, Videos

Knowledge

Collected by cleaning data from knowledge-intensive websites like Wikipedia and science and technology reports, and processing it using reverse engineering techniques.

3 papers0 benchmarksTexts

UIIS10K (General Underwater Image Instance Segmentation dataset 10K)

We propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. As far as we know, this is the largest underwater instance segmentation dataset available and can be used as a benchmark for evaluating underwater segmentation methods.

3 papers0 benchmarksImages, Texts

HPLT v2

Multilingual text collection extracted from the Internet Archive and Common Crawl archives. Intended to train large language models.

3 papers0 benchmarksTexts

AG’s Corpus (AG's corpus of news articlesNews)

Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.

2 papers0 benchmarksTexts

ArxivPapers

The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.

2 papers0 benchmarksTables, Texts

SegmentedTables

The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption. Tables are annotated with types (leaderboard, ablation, irrelevant) and cells of relevant tables are annotated with semantic roles (such as “paper model”, “competing model”, “dataset”, “metric”).

2 papers0 benchmarksTables, Texts

Spades (Semantic PArsing of DEclarative Sentences)

Datasets Spades contains 93,319 questions derived from clueweb09 sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs.

2 papers0 benchmarksTexts

PreviousPage 84 of 158Next