TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

VisPro

VisPro dataset contains coreference annotation of 29,722 pronouns from 5,000 dialogues.

6 papers0 benchmarksImages, Texts

iQIYI-VID

iQIYI-VID dataset, which comprises video clips from iQIYI variety shows, films, and television dramas. The whole dataset contains 500,000 videos clips of 5,000 celebrities. The length of each video is 1~30 seconds.

6 papers0 benchmarksVideos

GICoref (Gender Inclusive Coreference)

GICoref is a fully annotated coreference resolution dataset written by and about trans people.

6 papers0 benchmarksTexts

EHR-Rel

EHR-RelB is a benchmark dataset for biomedical concept relatedness, consisting of 3630 concept pairs sampled from electronic health records (EHRs). EHR-RelA is a smaller dataset of 111 concept pairs, which are mainly unrelated.

6 papers0 benchmarksBiomedical, Texts

ADVANCE (AuDio Visual Aerial sceNe reCognition datasEt)

The AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) is a brand-new multimodal learning dataset, which aims to explore the contribution of both audio and conventional visual messages to scene recognition. This dataset in summary contains 5075 pairs of geotagged aerial images and sounds, classified into 13 scene classes, i.e., airport, sports land, beach, bridge, farmland, forest, grassland, harbor, lake, orchard, residential area, shrub land, and train station.

6 papers0 benchmarksAudio, Images

Lytro Illum

Lytro Illum is a new light field dataset using a Lytro Illum camera. 640 light fields are collected with significant variations in terms of size, textureness, background clutter and illumination, etc. Micro-lens image arrays and central viewing images are generated, and corresponding ground-truth maps are produced.

6 papers0 benchmarksImages

3DPeople Dataset

A large-scale synthetic dataset with 2.5 Million photo-realistic images of 80 subjects performing 70 activities and wearing diverse outfits.

6 papers0 benchmarks

3DSeg-8

The 3DSeg-8 is a collection of several publicly available 3D segmentation datasets from different medical imaging modalities, e.g. magnetic resonance imaging (MRI) and computed tomography (CT), with various scan regions, target organs and pathologies.

6 papers0 benchmarksMedical

AO-CLEVr

AO-CLEVr is a new synthetic-images dataset containing images of "easy" Attribute-Object categories, based on the CLEVr. AO-CLEVr has attribute-object pairs created from 8 attributes: { red, purple, yellow, blue, green, cyan, gray, brown } and 3 object shapes {sphere, cube, cylinder}, yielding 24 attribute-object pairs. Each pair consists of 7500 images. Each image has a single object that consists of the attribute-object pair. The object is randomly assigned one of two sizes (small/large), one of two materials (rubber/metallic), a random position, and random lightning according to CLEVr defaults.

6 papers0 benchmarksImages

ArCOV-19

ArCOV-19 is an Arabic COVID-19 Twitter dataset that covers the period from 27th of January till 30th of April 2020. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes over 1M tweets alongside the propagation networks of the most-popular subset of them (i.e., most-retweeted and -liked).

6 papers0 benchmarksTexts

Bengali Hate Speech

Introduces three datasets of expressing hate, commonly used topics, and opinions for hate speech detection, document classification, and sentiment analysis, respectively.

6 papers0 benchmarks

BreizhCrops

BreizhCrops is a satellite image time series dataset for crop type classification. It consists on aggregated label data and Sentinel-2 top-of-atmosphere as well as bottom-of-atmosphere time series in the region of Brittany (Breizh in local language), north-east France.

6 papers0 benchmarks

ChineseFoodNet

ChineseFoodNet aims to automatically recognizing pictured Chinese dishes. Most of the existing food image datasets collected food images either from recipe pictures or selfie. In the dataset, images of each food category of the dataset consists of not only web recipe and menu pictures but photos taken from real dishes, recipe and menu as well. ChineseFoodNet contains over 180,000 food photos of 208 categories, with each category covering a large variations in presentations of same Chinese food.

6 papers0 benchmarksImages

CITE

CITE is a crowd-sourced resource for multimodal discourse: this resource characterises inferences in image-text contexts in the domain of cooking recipes in the form of coherence relations.

6 papers0 benchmarksImages, Texts

ClipShots

ClipShots is a large-scale dataset for shot boundary detection collected from Youtube and Weibo covering more than 20 categories, including sports, TV shows, animals, etc. In contrast to previous shot boundary detection datasets, e.g. TRECVID and RAI, which only consist of documentaries or talk shows where the frames are relatively static, ClipShots contains moslty short videos from Youtube and Weibo. Many short videos are home-made, with more challenges, e.g. hand-held vibrations and large occlusion. The types of these videos are various, including movie spotlights, competition highlights, family videos recorded by mobile phones etc. Each video has a length of 1-20 minutes. The gradual transitions in the dataset include dissolve, fade in fade out, and sliding in sliding out.

6 papers1 benchmarks

CLUECorpus2020

CLUECorpus2020 is a large-scale corpus that can be used directly for self-supervised learning such as pre-training of a language model, or language generation. It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.

6 papers0 benchmarksTexts

CoarseWSD-20

The CoarseWSD-20 dataset is a coarse-grained sense disambiguation dataset built from Wikipedia (nouns only) targeting 2 to 5 senses of 20 ambiguous words. It was specifically designed to provide an ideal setting for evaluating Word Sense Disambiguation (WSD) models (e.g. no senses in test sets missing from training), both quantitively and qualitatively.

6 papers0 benchmarksTexts

COCO-Tasks

Comprises about 40,000 images where the most suitable objects for 14 tasks have been annotated.

6 papers0 benchmarks

COUGH

A large challenging dataset, COUGH, for COVID-19 FAQ retrieval. Specifically, similar to a standard FAQ dataset, COUGH consists of three parts: FAQ Bank, User Query Bank and Annotated Relevance Set. FAQ Bank contains ~16K FAQ items scraped from 55 credible websites (e.g., CDC and WHO).

6 papers0 benchmarks

DaNE (Danish Dependency Treebank)

Danish Dependency Treebank (DaNE) is a named entity annotation for the Danish Universal Dependencies treebank using the CoNLL-2003 annotation scheme.

6 papers4 benchmarksTexts
PreviousPage 194 of 1000Next