19,997 machine learning datasets
19,997 dataset results
The Robo-VLN dataset is a continuous control formulation of the VLN-CE dataset by Krantz et al ported over from Room-to-Room (R2R) dataset created by Anderson et al. The details regarding converting discrete VLN dataset into continuous control formulation can be found in our paper.
The Signal Media One-Million News Articles Dataset dataset by Signal Media was released to facilitate researching news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.
The Mobile Turkish Scene Text (MTST 200) dataset consists of 200 indoor and outdoor Turkish scene text images.
Comparative Question Completion is a dataset to evaluate what do large Language Models learn.
AM2iCo is a wide-coverage and carefully designed cross-lingual and multilingual evaluation set. It aims to assess the ability of state-of-the-art representation models to reason over cross-lingual lexical-level concept alignment in context for 14 language pairs.
This is a dataset of code snippets in StackOverflow that have been used in Github repositories by extending and adapting them. The dataset links SO posts to GitHub counterparts based on clone detection, time stamp analysis, and explicit URL references.
GermanDPR is a dataset for passage retrieval in German. GermanDPR comprises 8,245 question/answer pairs in the training set, 1,030 pairs in the development set, and 1,025 pairs in the test set. For each pair, there are one positive context and three hard negative contexts.
LoED (LoRaWAN at the Edge Dataset) is a dataset from nine LoRaWAN gateways collected in an urban environment. The dataset contains raw payload information, along with other metadata from the gateway. The dataset contains packet header information and all physical layer properties reported by gateways such as the CRC, RSSI, SNR and spreading factor. Files are provided to analyse the data and get aggregated statistics
VOICe is a dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three different sound events ("baby crying", "glass breaking", and "gunshot"), which are over-imposed over three different categories of acoustic scenes: vehicle, outdoors, and indoors. Moreover, the mixtures are also offered without any background noise.
Weibo-COV is a large-scale COVID-19 social media dataset from Weibo, covering more than 30 million posts from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic posts information, interactive information, location information and retweet network.
The Pushshift Telegram dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users. The Pushshift Telegram dataset can help researchers from a variety of disciplines interested in studying online social movements, protests, political extremism, and disinformation.
EDNA-Covid is a multilingual, large-scale dataset of coronavirus-related tweets collected since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages.
The BugHunter dataset is an automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information.
iBugMask is an in-the-wild face parsing dataset that contains 1,000 challenging face images and manually annotated labels for 11 semantic classes: background, facial skin, left/right brow, left/right eye, nose, upper/lower lip, inner mouth, and hair. The images are curated from challenging in-the-wild face alignment datasets, including 300W and Menpo. Compared with the existing face parsing datasets, iBugMask contains in-the-wild scenarios such as “party” and “conference”, which include more challenging appearance variations or multiple faces. There is a larger number of profile faces. More expressions other than ”neutral” and ”smile” are also included (e.g. ”surprise” and ”scream”). The dataset can be downloaded on here.
JVS-MuSiC is a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer.
This is a multi-codec DASH dataset comprising AVC, HEVC, VP9, and AV1 in order to enable interoperability testing and streaming experiments for the efficient usage of these codecs under various conditions.
face dataset
The PolitiFact variant of the UPFD dataset for benchmarking.
FSVOD-500 is a large-scale video dataset comprising of 500 classes with class-balanced videos in each category for few-shot learning. FSVOD-500 is the first benchmark specially designed for few-shot video object detection for evaluating the performance of a given model on novel classes.
Tracking the Trackers is a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5 billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains.