19,997 machine learning datasets
19,997 dataset results
Grep-BiasIR is a novel thoroughly-audited dataset which aim to facilitate the studies of gender bias in the retrieved results of IR systems.
A corpus of Offensive Language and Hate Speech Detection for Danish
A dataset for evaluating text classification, domain adaptation, and active learning models. The dataset consists of 22,660 documents (tweets) collected in 2018 and 2019. It spans across four domains: Alzheimer's, Parkinson's, Cancer, and Diabetes.
MuMiN is a misinformation graph dataset containing rich social media data (tweets, replies, users, images, articles, hashtags), spanning 21 million tweets belonging to 26 thousand Twitter threads, each of which have been semantically linked to 13 thousand fact-checked claims across dozens of topics, events and domains, in 41 different languages, spanning more than a decade.
This dataset is an extension of MASAC, a multimodal, multi-party, Hindi-English code-mixed dialogue dataset compiled from the popular Indian TV show, ‘Sarabhai v/s Sarabhai’. WITS was created by augmenting MASAC with natural language explanations for each sarcastic dialogue. The dataset consists of the transcribed sarcastic dialogues from 55 episodes of the TV show, along with audio and video multimodal signals. It was designed to facilitate Sarcasm Explanation in Dialogue (SED), a novel task aimed at generating a natural language explanation for a given sarcastic dialogue, that spells out the intended irony. Each data instance in WITS is associated with a corresponding video, audio, and textual transcript where the last utterance is sarcastic in nature. All the final selected explanations contain the following attributes:
ImageNet-Patch: A Dataset for Benchmarking Machine Learning Robustness against Adversarial Patches
X-ray images in this data set have been collected by Shenzhen No.3 Hospital in Shenzhen, Guangdong providence,China. The x-rays were acquired as part of the routine care at Shenzhen Hospital. The set contains images in JPEG format. There are 326 normal x-raysand 336 abnormal x-rays showing various manifestations of tuberculosis.
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
Internet Archive videos (IACC.3) under Creative Commons licenses. The test video collection for TRECVID-AVS2016-TRECVID-AVS2018 contains 335,944 web video clips (600hr).
ActorShift is a dataset where the domain shift comes from the change in actor species: we use humans in the source domain and animals in the target domain. This causes large variances in the appearance and motion of activities. For the corresponding dataset we select 1,305 videos of 7 human activity classes from Kinetics-700 as the source domain: sleeping, watching tv, eating, drinking, swimming, running and opening a door. For the target domain we collect 200 videos from YouTube of animals performing the same activities. We divide them into 35 videos for training (5 per class) and 165 for evaluation. The target domain data is scarce, meaning there is the additional challenge of adapting to the target domain with few unlabeled examples.
The semantic line (SEL) dataset contains 1,750 outdoor images in total, which are split into 1,575 training and 175 testing images. Each semantic line is annotated by the coordinates of the two end-points on an image boundary. If an image has a single dominant line, it is set as the ground truth primary semantic line. If an image has multiple semantic lines, the line with the best rank by human annotators is set as the ground-truth primary line, and the others as additional ground-truth semantic lines. In SEL, 61% of the images contain multiple semantic lines.
A database of over 1.4 billion 3x3 convolution filters extracted from hundreds of diverse CNN models with relevant meta information.
We have added data and cleaned the labels in HC-STVG to build the HC-STVG2.0. While the original database contained 5660 videos, the new database has been re-annotated and modified and now contains 16,000 + videos for this challenge.
A collection of 840 pretrained VQA models which may be regular “clean” models or malicious “backdoored” models which have been trained to include a secret backdoor trigger and behavior. This collection includes models with traditional single-key backdoors as well as Dual-Key Multimodal Backdoors.
CLEVR-X is a dataset that extends the CLEVR dataset with natural language explanations in the context of VQA. It consists of 3.6 million natural language explanations for 850k question-image pairs.
RELiC is a large-scale dataset of 79k excerpts of literary scholarship, each containing a quotation from a primary source and the surrounding critical analysis. 79 public domain primary sources and over 8,836 secondary sources are represented in RELiC.
Abstract Lobachevsky University Electrocardiography Database (LUDB) is an ECG signal database with marked boundaries and peaks of P, T waves and QRS complexes. The database consists of 200 10-second 12-lead ECG signal records representing different morphologies of the ECG signal. The ECGs were collected from healthy volunteers and patients of the Nizhny Novgorod City Hospital No 5 in 2017–2018. The patients had various cardiovascular diseases while some of them had pacemakers. The boundaries of P, T waves and QRS complexes were manually annotated by cardiologists for all 200 records. Also, each record is annotated with the corresponding diagnosis. The database can be used for educational purposes as well as for training and testing algorithms for ECG delineation, i.e. for automatic detection of boundaries and peaks of P, T waves and QRS complexes.
test
A large scale Chinese multi-modal dialogue corpus (120.84K dialogues and 198.82 K images). MMCHAT contains image-grounded dialogues collected from real conversations on social media. We manually annotate 100K dialogues from MMCHAT with the dialogue quality and whether the dialogues are related to the given image. We provide the rule-filtered raw dialogues that are used to create MMChat (Rule Filtered Raw MMChat). It contains 4.257 M dialogue sessions and 4.874 M images We provide a version of MMChat that is filtered based on LCCC (LCCC Filtered MMChat). This version contain much cleaner dialogues (492.6 K dialogue sessions and 1.066 M images)