Datasets

19,997 machine learning datasets

19,997 dataset results

robo-vln (Robotics Vision-and-Language Navigation)

The Robo-VLN dataset is a continuous control formulation of the VLN-CE dataset by Krantz et al ported over from Room-to-Room (R2R) dataset created by Anderson et al. The details regarding converting discrete VLN dataset into continuous control formulation can be found in our paper.

2 papers1 benchmarksImages, RGB-D, Texts, Time series

Signal-1M

The Signal Media One-Million News Articles Dataset dataset by Signal Media was released to facilitate researching news articles. It can be used for submissions to the NewsIR'16 workshop, but it is intended to serve the community for research on news retrieval in general.

2 papers0 benchmarksTexts

MTST (Mobile Turkish Scene Text)

The Mobile Turkish Scene Text (MTST 200) dataset consists of 200 indoor and outdoor Turkish scene text images.

2 papers0 benchmarksImages

Comparative Question Completion

Comparative Question Completion is a dataset to evaluate what do large Language Models learn.

2 papers0 benchmarksTexts

AM2iCo (Adversarial and Multilingual Meaning in Context)

AM2iCo is a wide-coverage and carefully designed cross-lingual and multilingual evaluation set. It aims to assess the ability of state-of-the-art representation models to reason over cross-lingual lexical-level concept alignment in context for 14 language pairs.

2 papers0 benchmarksTexts

ExampleStack

This is a dataset of code snippets in StackOverflow that have been used in Github repositories by extending and adapting them. The dataset links SO posts to GitHub counterparts based on clone detection, time stamp analysis, and explicit URL references.

2 papers0 benchmarks

GermanDPR

GermanDPR is a dataset for passage retrieval in German. GermanDPR comprises 8,245 question/answer pairs in the training set, 1,030 pairs in the development set, and 1,025 pairs in the test set. For each pair, there are one positive context and three hard negative contexts.

2 papers0 benchmarksTexts

LoED

LoED (LoRaWAN at the Edge Dataset) is a dataset from nine LoRaWAN gateways collected in an urban environment. The dataset contains raw payload information, along with other metadata from the gateway. The dataset contains packet header information and all physical layer properties reported by gateways such as the CRC, RSSI, SNR and spreading factor. Files are provided to analyse the data and get aggregated statistics

2 papers0 benchmarks

VOICe

VOICe is a dataset for the development and evaluation of domain adaptation methods for sound event detection. VOICe consists of mixtures with three different sound events ("baby crying", "glass breaking", and "gunshot"), which are over-imposed over three different categories of acoustic scenes: vehicle, outdoors, and indoors. Moreover, the mixtures are also offered without any background noise.

2 papers0 benchmarksAudio

Weibo-COV

Weibo-COV is a large-scale COVID-19 social media dataset from Weibo, covering more than 30 million posts from 1 November 2019 to 30 April 2020. Moreover, the field information of the dataset is very rich, including basic posts information, interactive information, location information and retweet network.

2 papers0 benchmarksTexts

Pushshift Telegram

The Pushshift Telegram dataset is made up of over 27.8K channels and 317M messages from 2.2M unique users. The Pushshift Telegram dataset can help researchers from a variety of disciplines interested in studying online social movements, protests, political extremism, and disinformation.

2 papers0 benchmarks

EDNA-Covid

EDNA-Covid is a multilingual, large-scale dataset of coronavirus-related tweets collected since January 25, 2020. EDNA-Covid includes, at time of this publication, over 600M tweets from around the world in over 10 languages.

2 papers0 benchmarksTexts

BugHunter

The BugHunter dataset is an automatically constructed and freely available bug dataset containing code elements (files, classes, methods) with a wide set of code metrics and bug information.

2 papers0 benchmarks

iBugMask

iBugMask is an in-the-wild face parsing dataset that contains 1,000 challenging face images and manually annotated labels for 11 semantic classes: background, facial skin, left/right brow, left/right eye, nose, upper/lower lip, inner mouth, and hair. The images are curated from challenging in-the-wild face alignment datasets, including 300W and Menpo. Compared with the existing face parsing datasets, iBugMask contains in-the-wild scenarios such as “party” and “conference”, which include more challenging appearance variations or multiple faces. There is a larger number of profile faces. More expressions other than ”neutral” and ”smile” are also included (e.g. ”surprise” and ”scream”). The dataset can be downloaded on here.

2 papers2 benchmarksImages

JVS-MuSiC

JVS-MuSiC is a Japanese multispeaker singing-voice corpus called "JVS-MuSiC" with the aim to analyze and synthesize a variety of voices. The corpus consists of 100 singers' recordings of the same song, Katatsumuri, which is a Japanese children's song. It also includes another song that is different for each singer.

2 papers0 benchmarksAudio, Speech

Multi-Codec DASH

This is a multi-codec DASH dataset comprising AVC, HEVC, VP9, and AV1 in order to enable interoperability testing and streaming experiments for the efficient usage of these codecs under various conditions.

2 papers0 benchmarks

warpPIE10P

face dataset

2 papers1 benchmarks

UPFD-POL (User Preference-aware Fake News Detection)

The PolitiFact variant of the UPFD dataset for benchmarking.

2 papers2 benchmarksGraphs, Texts

FSVOD-500

FSVOD-500 is a large-scale video dataset comprising of 500 classes with class-balanced videos in each category for few-shot learning. FSVOD-500 is the first benchmark specially designed for few-shot video object detection for evaluating the performance of a given model on novel classes.

2 papers0 benchmarksImages

Tracking the Trackers

Tracking the Trackers is a large-scale analysis of third-party trackers on the World Wide Web. We extract third-party embeddings from more than 3.5 billion web pages of the CommonCrawl 2012 corpus, and aggregate those to a dataset containing more than 140 million third-party embeddings in over 41 million domains.

2 papers0 benchmarks

PreviousPage 310 of 1000Next