TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

EDUB-Seg (Egocentric Dataset of the University of Barcelona – Segmentation)

Egocentric Dataset of the University of Barcelona – Segmentation (EDUB-Seg) is a dataset for egocentric event segmentation acquired by the Narrative Clip, which takes a picture every 30 seconds. The dataset contains a total of 18,735 images captured by 7 different users during overall 20 days. To ensure diversity, all users were wearing the camera in different contexts: while attending a conference, on holiday, during the weekend, and during the week.

4 papers0 benchmarksImages

EMU (Edited Media Understanding)

48k question-answer pairs written in rich natural language.

4 papers0 benchmarksImages, Texts

EndoSLAM (Endoscopic SLAM dataset)

The endoscopic SLAM dataset (EndoSLAM) is a dataset for depth estimation approach for endoscopic videos. It consists of both ex-vivo and synthetically generated data. The ex-vivo part of the dataset includes standard as well as capsule endoscopy recordings. The dataset is divided into 35 sub-datasets. Specifically, 18, 5 and 12 sub-datasets exist for colon, small intestine and stomach respectively.

4 papers0 benchmarksImages

ESAD (SARAS Endoscopic Surgeon Action Detection)

ESAD is a large-scale dataset designed to tackle the problem of surgeon action detection in endoscopic minimally invasive surgery. ESAD aims at contributing to increase the effectiveness and reliability of surgical assistant robots by realistically testing their awareness of the actions performed by a surgeon. The dataset provides bounding box annotation for 21 action classes on real endoscopic video frames captured during prostatectomy, and was used as the basis of a recent MIDL 2020 challenge.

4 papers0 benchmarksImages, Medical

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes:

4 papers6 benchmarks6D, Actions, Biomedical, EEG, Environment, Replay data, Tabular, Time series, Tracking

EventKG+Click

Builds upon the event-centric EventKG knowledge graph and language-specific information on user interactions with events, entities, and their relations derived from the Wikipedia clickstream.

4 papers0 benchmarks

Finer (Finnish News Corpus for Named Entity Recognition)

Finnish News Corpus for Named Entity Recognition (Finer) is a corpus that consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event,and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source.

4 papers0 benchmarksTexts

Gazeta

Gazeta is a dataset for automatic summarization of Russian news. The dataset consists of 63,435 text-summary pairs. To form training, validation, and test datasets, these pairs were sorted by time and the first 52,400 pairs are used as the training dataset, the proceeding 5,265 pairs as the validation dataset, and the remaining 5,770 pairs as the test dataset.

4 papers5 benchmarksTexts

GeoCoV19

GeoCoV19 is a large-scale Twitter dataset containing more than 524 million multilingual tweets. The dataset contains around 378K geotagged tweets and 5.4 million tweets with Place information. The annotations include toponyms from the user location field and tweet content and resolve them to geolocations such as country, state, or city level. In this case, 297 million tweets are annotated with geolocation using the user location field and 452 million tweets using tweet content.

4 papers0 benchmarksTexts

GGPONC (German Guideline Program in Oncology NLP Corpus)

German Guideline Program in Oncology NLP Corpus (GGPONC) is a German language corpus based on clinical practice guidelines for oncology. This corpus is one of the largest ever built from German medical documents. Unlike clinical documents, clinical guidelines do not contain any patient-related information and can therefore be used without data protection restrictions.

4 papers0 benchmarksTexts

HarperValleyBank

The data simulate simple consumer banking interactions, containing about 23 hours of audio from 1,446 human-human conversations between 59 unique speakers.

4 papers0 benchmarks

HLA-Chat

Models character profiles and gives dialogue agents the ability to learn characters' language styles through their HLAs.

4 papers0 benchmarksTexts

Houses3K

Houses3K is a dataset of 3000 textured 3D house models. Houses3K is divided into twelve batches, each containing 50 unique house geometries. For each batch, five different textures were applied forming the sets (A, B, C, D, E).

4 papers0 benchmarks

Hyperspectral City

Propose a dataset which adopts multi-channel visual input.

4 papers8 benchmarksImages

Icentia11K

Public ECG dataset of continuous raw signals for representation learning containing 11 thousand patients and 2 billion labelled beats.

4 papers0 benchmarks

Image Editing Request Dataset

A new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.

4 papers0 benchmarks

IPRE

A dataset for inter-personal relationship extraction which aims to facilitate information extraction and knowledge graph construction research. In total, IPRE has over 41,000 labeled sentences for 34 types of relations, including about 9,000 sentences annotated by workers.

4 papers0 benchmarks

irc-disentanglement

This is a dataset for disentangling conversations on IRC, which is the task of identifying separate conversations in a single stream of messages. It contains disentanglement information for 77,563 messages or IRC.

4 papers5 benchmarksTexts

IS-A

The IS-A dataset is a dataset of relations extracted from a medical ontology. The different entities in the ontology are related by the “is a” relation. For example, ‘acute leukemia’ is a ‘leukemia’. The dataset has 294,693 nodes with 356,541 edges between them.

4 papers0 benchmarksGraphs, Medical, Texts

Kitchen Scenes

Kitchen Scenes is a multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud.

4 papers0 benchmarks3D, Images, Videos
PreviousPage 233 of 1000Next