3,148 machine learning datasets
3,148 dataset results
The CATT benchmark dataset comprises 742 sentences, which were scraped from an internet news source in 2023. It covers multiple topics including science and technology, economics, politics, sports, arts, and culture. It was manually diacritized by two expert native Arabic speakers and then validated by a third expert. This dataset contains names of people and places in both Arabic and English. As for the English names, they are written in Arabic letters and diacritized based on their pronunciation. Also, the numbers in the sentences are written in textual form rather than the numeric form which helps in evaluating the models without the need for a text normalizer (TN).
The dataset contains Amazon products from 10 product categories with full human annotations. The dataset was collected in 2021. The products may have been taken down from Amazon since the collection of the dataset.
The ToolLens dataset consists of 18,770 concise yet intentionally multifaceted queries, each associated with 1 to 3 verified tools out of a total of 464, designed to better mimic real-world user interactions.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The Kvasir-VQA dataset is an extended dataset derived from the HyperKvasir and Kvasir-Instrument datasets, augmented with question-and-answer annotations. This dataset is designed to facilitate advanced machine learning tasks in gastrointestinal (GI) diagnostics, including image captioning, Visual Question Answering (VQA) and text-based generation of synthetic medical images.
A benchmark designed to evaluate MLLMs’ proficiency in understanding inter-object relationships and textual content.
This dataset includes reviews (ratings, text, helpfulness votes), product metadata (descriptions, category information, price, brand, and image features), and links (also viewed/also bought graphs).
Test-driven benchmark to challenge LLMs to write JavaScript React application
EC-FUNSD is introduced in [arXiv:2402.02379] as a benchmark of semantic entity recognition (SER) and entity linking (EL), designed for the entity-centric robustness evaluation of pre-trained text-and-layout models (PTLMs).
ROOR is a reading order prediction (ROP) benchmark which annotates layout reading order as ordering relations.
The dataset was proposed in LLaVA-CoT: Let Vision Language Models Reason Step-by-Step.
TriBERT dataset consists of 12,049 training, 2,527 validation and 2,560 test Human-Machine collaborative texts. Each text contains both human-written and LLM-generated parts, which can appear in different orders (human → AI, AI → human). Therefore, each sample has between 1 and 3 boundaries, indicating the sentences where authorship changes. The texts were created using humanwritten essays with LLM-generated sections added using ChatGPT.
Operating rooms (ORs) are complex, high-stakes environments requiring precise understanding of interactions among medical staff, tools, and equipment for enhancing surgical assistance, situational awareness, and patient safety. Current datasets fall short in scale, realism and do not capture the multimodal nature of OR scenes, limiting progress in OR modeling. To this end, we introduce MM-OR, a realistic and large-scale multimodal spatiotemporal OR dataset, and the first dataset to enable multimodal scene graph generation. MM-OR captures comprehensive OR scenes containing RGB-D data, detail views, audio, speech transcripts, robotic logs, and tracking data and is annotated with panoptic segmentations, semantic scene graphs, and downstream task labels. Further, we propose MM2SG, the first multimodal large vision-language model for scene graph generation, and through extensive experiments, demonstrate its ability to effectively leverage multimodal inputs. Together, MM-OR and MM2SG establi
Collected by cleaning data from knowledge-intensive websites like Wikipedia and science and technology reports, and processing it using reverse engineering techniques.
We propose a large-scale underwater instance segmentation dataset, UIIS10K, which includes 10,048 images with pixel-level annotations for 10 categories. As far as we know, this is the largest underwater instance segmentation dataset available and can be used as a benchmark for evaluating underwater segmentation methods.
Multilingual text collection extracted from the Internet Archive and Common Crawl archives. Intended to train large language models.
Antonio Gulli’s corpus of news articles is a collection of more than 1 million news articles. The articles have been gathered from more than 2000 news sources by ComeToMyHead in more than 1 year of activity. ComeToMyHead is an academic news search engine which has been running since July, 2004. The dataset is provided by the academic comunity for research purposes in data mining (clustering, classification, etc), information retrieval (ranking, search, etc), xml, data compression, data streaming, and any other non - commercial activity.
The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.
The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption. Tables are annotated with types (leaderboard, ablation, irrelevant) and cells of relevant tables are annotated with semantic roles (such as “paper model”, “competing model”, “dataset”, “metric”).
Datasets Spades contains 93,319 questions derived from clueweb09 sentences. Specifically, the questions were created by randomly removing an entity, thus producing sentence-denotation pairs.