Datasets

19,997 machine learning datasets

19,997 dataset results

MSU Deinterlacer Benchmark

This is a dataset for video deinterlacing problem. The dataset contains 28 video sequences. Each sequence's length is 60 frames. Resolution of all video sequences is 1920x1080. TFF interlacing was used to get interlaced data from GT.

4 papers5 benchmarksRGB Video, Videos

K-Hairstyle

K-hairstyle is a novel large-scale Korean hairstyle dataset with 256,679 high-resolution images. In addition, K-hairstyle contains various hair attributes annotated by Korean expert hair stylists and hair segmentation masks.

4 papers0 benchmarksImages

3D Vehicle Tracking Simulation Dataset

To collect the 3D Vehicle Tracking Simulation Dataset, a driving simulation is used to obtain accurate 3D bounding box annotations at no cost of human efforts. The data collection and annotation pipeline extend the previous works like VIPER and FSV, especially in terms of linking identities across frames. The simulation is based on Grand Theft Auto V, a modern game that simulates a functioning city and its surroundings in a photo-realistic three-dimensional world. Note that the pipeline is real-time, providing the potential of largescale data collection, while VIPER requires expensive offline processings.

4 papers0 benchmarksEnvironment

F-SIOL-310 (Few-Shot Incremental Object Learning)

F-SIOL-310 is a robotic dataset and benchmark for Few-Shot Incremental Object Learning, which is used to test incremental learning capabilities for robotic vision from a few examples.

4 papers0 benchmarksImages

Us Vs. Them (Us vs. Them: A Dataset of Populist Attitudes, News Bias and Emotions)

$\textit{Us vs. Them}$ dataset, consisting of 6861 Reddit comments annotated for populist attitudes and the first large-scale computational models of this phenomenon. It covers the relationship between populist mindsets and social groups, as well as a range of emotions typically associated with these.

4 papers0 benchmarks

Finnish Paraphrase Corpus

Finnish Paraphrase Corpus is a fully manually annotated paraphrase corpus for Finnish containing 53,572 paraphrase pairs harvested from alternative subtitles and news headings. Out of all paraphrase pairs in the corpus 98% are manually classified to be paraphrases at least in their given context, if not in all contexts.

4 papers0 benchmarksTexts

LemgoRL

LemgoRL is an open-source benchmark tool for traffic signal control designed to train reinforcement learning agents in a highly realistic simulation scenario with the aim to reduce Sim2Real gap. In addition to the realistic simulation model, LemgoRL encompasses a traffic signal logic unit that ensures compliance with all regulatory and safety requirements. LemgoRL offers the same interface as the well-known OpenAI gym toolkit to enable easy deployment in existing research work.

4 papers0 benchmarksEnvironment

OCD (Out-of-Context Dataset)

OCD (Out-of-Context Dataset) is a synthetic dataset with fine-grained control over scene context. The images are generated using a 3D simulation engine in the VirtualHome environment, which allows to control the gravity, object co-occurrences and relative sizes across 36 object categories in a virtual household environment.

4 papers0 benchmarksImages

WEC-Eng

WEC-eng is a cross-document event coreference resolution dataset extracted from English Wikipedia. Coreference links are not restricted within predefined topics. The training set includes 40,529 mentions distributed into 7,042 coreference clusters.

4 papers0 benchmarksTexts

ATIS (vi) (Vietnamese Intent Detection and Slot Filling)

This is a dataset for intent detection and slot filling for the Vietnamese language. The dataset consists of 5,871 gold annotated utterances with 28 intent labels and 82 slot types.

4 papers2 benchmarksTexts

MS^2 (Multi-Document Summarization of Medical Studies)

MS^2 (Multi-Document Summarization of Medical Studies) is a dataset of over 470k documents and 20k summaries derived from the scientific literature. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is one of the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.

4 papers2 benchmarksTexts

DiS-ReX

DiS-ReX is a multilingual dataset for distantly supervised (DS) relation extraction (RE). The dataset has over 1.5 million instances, spanning 4 languages (English, Spanish, German and French). The dataset has 36 positive relation types + 1 no relation (NA) class.

4 papers0 benchmarksTexts

Concadia

Concadia is a publicly available Wikipedia-based corpus, which consists of 96,918 images with corresponding English-language descriptions, captions, and surrounding context.

4 papers0 benchmarksImages, Texts

WikiCLIR

WikiCLIR is a large-scale (German-English) retrieval data set for Cross-Language Information Retrieval (CLIR). It contains a total of 245,294 German single-sentence queries with 3,200,393 automatically extracted relevance judgments for 1,226,741 English Wikipedia articles as documents. Queries are well-formed natural language sentences that allow large-scale training of (translation-based) ranking models.

4 papers0 benchmarksTexts

ImageNet-50 (TEMI Split)

The ImageNet-50 dataset split as introduced in TEMI. Adaloglou, Nikolas, Felix Michels, Hamza Kalisch, and Markus Kollmann. "Exploring the Limits of Deep Image Clustering using Pretrained Models." In BMVC. 2023.

4 papers3 benchmarks

KazakhTTS

KazakhTTS is an open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 91 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry.

4 papers0 benchmarksSpeech, Texts

Election2020

Election2020 is a Twitter dataset on the 2020 US presidential elections. To facilitate the understanding of political discourse and try to empower the Computational Social Science research community, the authors decided to publicly release this massive-scale, longitudinal dataset of U.S. politics- and election-related tweets. This multilingual dataset encompasses hundreds of millions of tweets and tracks all salient U.S. politics trends, actors, and events between 2019 and 2020. It predates and spans the whole period of Republican and Democratic primaries, with real-time tracking of all presidential contenders of both sides of the isle. After that, it focuses on presidential and vice-presidential candidates. The dataset release is curated, documented and will be constantly updated on a weekly-basis, until the November 3, 2020 election and beyond.

4 papers0 benchmarks

Wikipedia Citations

Wikipedia Citations is a comprehensive dataset of citations extracted from Wikipedia. A total of 29.3M citations were extracted from 6.1M English Wikipedia articles as of May 2020, and classified as being to books, journal articles or Web contents. We were thus able to extract 4.0M citations to scholarly publications with known identifiers -- including DOI, PMC, PMID, and ISBN -- and further equip an extra 261K citations with DOIs from Crossref. As a result, we find that 6.7% of Wikipedia articles cite at least one journal article with an associated DOI, and that Wikipedia cites just 2% of all articles with a DOI currently indexed in the Web of Science.

4 papers0 benchmarks

OMD (Oxford Multimotion Dataset)

The Oxford Multimotion Dataset (OMD) provides a number of multimotion estimation problems of varying complexity. It includes both complex problems that challenge existing algorithms as well as a number of simpler problems to support development. These include observations from both static and dynamic sensors, a varying number of moving bodies, and a variety of different 3D motions. It also provides a number of experiments designed to isolate specific challenges of the multimotion problem, including rotation about the optical axis and occlusion. In total, the Oxford Multimotion Dataset contains over 110 minutes of multimotion data consisting of stereo and RGB-D camera images, IMU data, and Vicon ground-truth trajectories. The dataset culminates in a complex toy car segment representative of many challenging real-world scenarios.

4 papers0 benchmarks

Warblr

Warblr is a dataset for the acoustic detection of birds. The dataset comes from a UK bird-sound crowdsourcing research spinout called Warblr. From this initiative the authors collected over 10,000 ten-second smartphone audio recordings from around the UK. The audio totals around 28 hours duration.

4 papers0 benchmarksAudio

PreviousPage 237 of 1000Next