19,997 machine learning datasets
19,997 dataset results
ClueWeb22 is the newest iteration of the ClueWeb line of datasets, provides 10 billion web pages affiliated with rich information. Its design was influenced by the need for a high quality, large scale web corpus to support a range of academic and industry research, for example, in information systems, retrieval-augmented AI systems, and model pretraining. Compared with earlier CLUEWeb corpora, the ClUEWeb22 corpus is larger, more varied, of higher-quality, and aligned with the document distributions in commercial web search. Besides raw HTML, the dataset includes rich information about the web pages provided by industry-standard document understanding systems, including the visual representation of pages rendered by a web browser, parsed HTML structure information from a neural network parser, and pre-processed cleaned document text.
VOT2020 is a Visual Object Tracking benchmark for short-term tracking in RGB.
RadQA is a radiology question answering dataset with 3074 questions posed against radiology reports and annotated with their corresponding answer spans (resulting in a total of 6148 question-answer evidence pairs) by physicians. The questions are manually created using the clinical referral section of the reports that take into account the actual information needs of ordering physicians and eliminate bias from seeing the answer context (and, further, organically create unanswerable questions). The answer spans are marked within the Findings and Impressions sections of a report. The dataset aims to satisfy the complex clinical requirements by including complete (yet concise) answer phrases (which are not just entities) that can span multiple lines.
PSI-AVA is a dataset designed for holistic surgical scene understanding. It contains approximately 20.45 hours of the surgical procedure performed by three expert surgeons and annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos.
Dataset for document shadow removal
The dataset contains single-shot videos taken from moving cameras in underwater environments. The first shard of a new Marine Video Kit dataset is presented to serve for video retrieval and other computer vision challenges. In addition to basic meta-data statistics, we present several insights based on low-level features as well as semantic annotations of selected keyframes. 1379 videos with a length from 2 s to 4.95 min, with the mean and median duration of each video is 29.9 s, and 25.4 s, respectively. We capture data from 11 different regions and countries during the time from 2011 to 2022.
We propose a test to measure the multitask accuracy of large Chinese language models. We constructed a large-scale, multi-task test consisting of single and multiple-choice questions from various branches of knowledge. The test encompasses the fields of medicine, law, psychology, and education, with medicine divided into 15 sub-tasks and education into 8 sub-tasks. The questions in the dataset were manually collected by professionals from freely available online resources, including university medical examinations, national unified legal professional qualification examinations, psychological counselor exams, graduate entrance examinations for psychology majors, and the Chinese National College Entrance Examination. In total, we collected 11,900 questions, which we divided into a few-shot development set and a test set. The few-shot development set contains 5 questions per topic, amounting to 55 questions in total. The test set comprises 11,845 questions.
The Zenseact Open Dataset (ZOD) is a large-scale and diverse multi-modal autonomous driving (AD) dataset, created by researchers at Zenseact. It was collected over a 2-year period in 14 different European counties, using a fleet of vehicles equipped with a full sensor suite. The dataset consists of three subsets: Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatiotemporal learning, sensor fusion, localization, and mapping.
SAMRS is a remote sensing segmentation dataset which provides object category, location, and instance information that can be used for semantic segmentation, instance segmentation, and object detection, either individually or in combination.
PerSeg is a dataset for personalized segmentation. The raw images are collect from the training data of subject driven diffusion models: DreamBooth, Textual Inversion, and Custom Diffusion. PerSeg contains 40 objects of various categories in total, including daily necessities, animals, and buildings. Contextualized in different poses or scenes, each object is related with 5∼7 images with our annotated masks.
WebCPM is a Chinese LFQA dataset. It contains 5,500 high-quality question-answer pairs, together with 14,315 supporting facts and 121,330 web search actions.
Bactrian-X is a comprehensive multilingual parallel dataset of 3.4 million instruction-response pairs across 52 languages. The instructions were obtained from alpaca-52k, and dolly-15k, and tranlated into 52 languages (52 languages x 67k instances = 3.4M instances).
FinRED is a relation extraction dataset curated from financial news and earning call transcripts containing relations from the finance domain. FinRED has been created by mapping Wikidata triplets using distance supervision method.
Paper | Github | Dataset| Model
Defects4J is a collection of reproducible bugs and a supporting infrastructure with the goal of advancing software engineering research.
This is the first general Underwater Image Instance Segmentation (UIIS) dataset containing 4,628 images for 7 categories with pixel-level annotations for underwater instance segmentation task
his is an academic intrusion detection dataset. All the credit goes to the original authors: Dr. Iman Sharafaldin, Dr. Saqib Hakak, Dr. Arash Habibi Lashkari Dr. Ali Ghorbani. Please cite their original paper.
To make synthetic images match the property of real dark photography, we analyze the illumination distribution of low-light images. We collect 270 low-light images from public MEF [42], NPE [6], LIME [8], DICM [43], VV,2 and Fusion [44] dataset, transform the imagesT into YCbCr channel and calculate the histogram of Y channel. We also collect 1000 raw images from RAISE [45] as normal-light images and calculate the histogram of Y channel in YCbCr.
In this work, we propose a general dataset for Color-Event camera based Single Object Tracking, termed COESOT. It contains 1354 color-event videos with 478,721 RGB frames. We split them into a training and testing subset, which contains 827 and 527 videos, respectively. The videos are collected from both outdoor and indoor scenarios (such as the street, zoo, and home) using the DAVIS346 event camera with a zoom lens. Therefore, our videos can reflect the variation in the distance at depth, but other datasets are failed to. Different from existing benchmarks which contain limited categories, our proposed COESOT covers a wider range of object categories (90 classes), as shown in Fig. 3 (a). It mainly reflects four groups, including persons, animals, electronics, and other goods.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).