Datasets

19,997 machine learning datasets

19,997 dataset results

ClueWeb22

ClueWeb22 is the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier CLUEWeb corpora, the ClUEWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, the dataset includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text.

9 papers0 benchmarksImages, Texts

VOT2020

VOT2020 is a Visual Object Tracking benchmark for short-term tracking in RGB.

9 papers6 benchmarksImages, Tracking, Videos

RadQA (A Question Answering Dataset to Improve Comprehension of Radiology Reports)

RadQA is a radiology question answering dataset with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines.

9 papers1 benchmarksMedical, Texts

PSI-AVA

PSI-AVA is a dataset designed for holistic surgical scene understanding. It contains approximately 20.45 hours of the surgical procedure performed by three expert surgeons and annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos.

9 papers0 benchmarksMedical, Videos

Kligler

Dataset for document shadow removal

9 papers0 benchmarks

MVK (Marine Video Kit)

The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. 1379 videos with a length from 2 s to 4.95 min, with the mean and median duration of each video is 29.9 s, and 25.4 s, respectively. We capture data from 11 diﬀerent regions and countries during the time from 2011 to 2022.

9 papers1 benchmarksImages, Texts, Videos

MMCU (Measuring Massive Multitask Chinese Understanding)

We propose a test to measure the multitask accuracy of large Chinese language models. We constructed a large-scale, multi-task test consisting of single and multiple-choice questions from various branches of knowledge. The test encompasses the fields of medicine, law, psychology, and education, with medicine divided into 15 sub-tasks and education into 8 sub-tasks. The questions in the dataset were manually collected by professionals from freely available online resources, including university medical examinations, national unified legal professional qualification examinations, psychological counselor exams, graduate entrance examinations for psychology majors, and the Chinese National College Entrance Examination. In total, we collected 11,900 questions, which we divided into a few-shot development set and a test set. The few-shot development set contains 5 questions per topic, amounting to 55 questions in total. The test set comprises 11,845 questions.

9 papers0 benchmarksTexts

Zenseact Open Dataset

The Zenseact Open Dataset (ZOD) is a large-scale and diverse multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European counties, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping.

9 papers0 benchmarks3D, Images, LiDAR, Time series, Videos

SAMRS

SAMRS is a remote sensing segmentation dataset which provides object category, location, and instance information that can be used for semantic segmentation, instance segmentation, and object detection, either individually or in combination.

9 papers0 benchmarksImages

PerSeg

PerSeg is a dataset for personalized segmentation. The raw images are collect from the training data of subject driven diffusion models: DreamBooth, Textual Inversion, and Custom Diffusion. PerSeg contains 40 objects of various categories in total, including daily necessities, animals, and buildings. Contextualized in different poses or scenes, each object is related with 5∼7 images with our annotated masks.

9 papers1 benchmarksImages

WebCPM

WebCPM is a Chinese LFQA dataset. It contains 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.

9 papers0 benchmarksTexts

Bactrian-X

Bactrian-X is a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances).

9 papers0 benchmarksTexts

FinRED

FinRED is a relation extraction dataset curated from financial news and earning call transcripts containing relations from the finance domain. FinRED has been created by mapping Wikidata triplets using distance supervision method.

9 papers0 benchmarksTexts

HarmfulQA

Paper | Github | Dataset| Model

9 papers1 benchmarksTexts

Defects4J

Defects4J is a collection of reproducible bugs and a supporting infrastructure with the goal of advancing software engineering research.

9 papers3 benchmarks

UIIS (General Underwater Image Instance Segmentation dataset)

This is the first general Underwater Image Instance Segmentation (UIIS) dataset containing 4,628 images for 7 categories with pixel-level annotations for underwater instance segmentation task

9 papers2 benchmarksImages, Texts

CIC-DDoS2019

his is an academic intrusion detection dataset. All the credit goes to the original authors: Dr. Iman Sharafaldin, Dr. Saqib Hakak, Dr. Arash Habibi Lashkari Dr. Ali Ghorbani. Please cite their original paper.

9 papers0 benchmarks

LOLv2-synthetic

To make synthetic images match the property of real dark photography, we analyze the illumination distribution of low-light images. We collect 270 low-light images from public MEF [42], NPE [6], LIME [8], DICM [43], VV,2 and Fusion [44] dataset, transform the imagesT into YCbCr channel and calculate the histogram of Y channel. We also collect 1000 raw images from RAISE [45] as normal-light images and calculate the histogram of Y channel in YCbCr.

9 papers3 benchmarks

COESOT

In this work, we propose a general dataset for Color-Event camera based Single Object Tracking, termed COESOT. It contains 1354 color-event videos with 478,721 RGB frames. We split them into a training and testing subset, which contains 827 and 527 videos, respectively. The videos are collected from both outdoor and indoor scenarios (such as the street, zoo, and home) using the DAVIS346 event camera with a zoom lens. Therefore, our videos can reflect the variation in the distance at depth, but other datasets are failed to. Different from existing benchmarks which contain limited categories, our proposed COESOT covers a wider range of object categories (90 classes), as shown in Fig. 3 (a). It mainly reflects four groups, including persons, animals, electronics, and other goods.

9 papers2 benchmarks

CodeTransOcean

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

9 papers0 benchmarks

PreviousPage 166 of 1000Next