TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

PreSIL (Precise Synthetic Image and LiDAR)

Consists of over 50,000 frames and includes high-definition images with full resolution depth information, semantic segmentation (images), point-wise segmentation (point clouds), and detailed annotations for all vehicles and people.

13 papers0 benchmarks

SARA (StAtutory Reasoning Assessment)

A dataset for statutory reasoning in tax law entailment and question answering.

13 papers0 benchmarks

SciTLDR

A new multi-target dataset of 5.4K TLDRs over 3.2K papers. SciTLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries while minimizing annotation burden.

13 papers0 benchmarksTexts

Story Commonsense

Story Commonsense is a new large-scale dataset with rich low-level annotations and establishes baseline performance on several new tasks, suggesting avenues for future research.

13 papers0 benchmarks

TTPLA (Transmission Towers and Power Lines (TTPLA))

TTPLA is a public dataset which is a collection of aerial images on Transmission Towers (TTs) and Power Lines (PLs). It can be used for detection and segmentation of transmission towers and power lines. It consists of 1,100 images with the resolution of 3,840×2,160 pixels, as well as manually labelled 8,987 instances of TTs and PLs.

13 papers0 benchmarksImages

UFDD (Unconstrained Face Detection Dataset)

Unconstrained Face Detection Dataset (UFDD) aims to fuel further research in unconstrained face detection.

13 papers0 benchmarksImages

WeatherBench

A benchmark dataset for data-driven medium-range weather forecasting, a topic of high scientific interest for atmospheric and computer scientists alike.

13 papers0 benchmarks

WSVD (Web Stereo Video Dataset)

The Web Stereo Video Dataset consists of 553 stereoscopic videos from YouTube. This dataset has a wide variety of scene types, and features many nonrigid objects.

13 papers0 benchmarksStereo, Videos

YUP++ (YUP++ Dynamic Scenes dataset)

A new and challenging video database of dynamic scenes that more than doubles the size of those previously available. This dataset is explicitly split into two subsets of equal size that contain videos with and without camera motion to allow for systematic study of how this variable interacts with the defining dynamics of the scene per se.

13 papers4 benchmarks

BrixIA (BrixIA Covid-19)

BrixIA Covid-19 is a large dataset of CXR images corresponding to the entire amount of images taken for both triage and patient monitoring in sub-intensive and intensive care units during one month (between March 4th and April 4th 2020) of pandemic peak at the ASST Spedali Civili di Brescia, and contains all the variability originating from a real clinical scenario. It includes 4,707 CXR images of COVID-19 subjects, acquired with both CR and DX modalities, in AP or PA projection, and retrieved from the facility RIS-PACS system.

13 papers0 benchmarksImages, Medical

KorSTS

KorSTS is a dataset for semantic textural similarity (STS) in Korean. The dataset is constructed by automatically the STS-B dataset. To ensure translation quality, two professional translators with at least seven years of experience who specialize in academic papers/books as well as business contracts post-edited a half of the dataset each and cross-checked each other’s translation afterward. The KorSTS dataset comprises 5,749 training examples translated automatically and 2,879 evaluation examples translated manually.

13 papers0 benchmarksTexts

PersonalDialog

PersonalDialog is a large-scale multi-turn dialogue dataset containing various traits from a large number of speakers. The dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers. Each utterance is associated with a speaker who is marked with traits like Age, Gender, Location, Interest Tags, etc. Several anonymization schemes are designed to protect the privacy of each speaker.

13 papers0 benchmarksTexts

PISC (People in Social Context)

The People in Social Context (PISC) dataset is a dataset that focuses on social relationships. It consists of 22,670 images of 9 types of social relationships. It has annotations for the bounding boxes of all people, as well as the social relationship between all pairs of people in the images. In addition, it also contains occupation annotation.

13 papers2 benchmarksImages

KorQuAD (The Korean Question Answering Dataset)

KorQuAD is a large-scale question-and-answer dataset constructed for Korean machine reading comprehension, and investigate the dataset to understand the distribution of answers and the types of reasoning required to answer the question. This dataset benchmarks the data generating process of SQuAD to meet the standard.

13 papers0 benchmarks

ecoset

Ecoset, an ecologically motivated image dataset, is a large-scale image dataset designed for human visual neuroscience, which consists of over 1.5 million images from 565 basic-level categories. Category selection was based on English nouns that most frequently occur in spoken language (estimated on a set of 51 million words obtained from American television and film subtitles) and concreteness ratings from human observers. Ecoset consists of basic-level categories (including human categories man, woman, and child) that describe physical things in the world (rather than abstract concepts) that are important to humans.

13 papers0 benchmarksImages

BC2GM

Created by Smith et al. at 2008, the BioCreative II Gene Mention Recognition (BC2GM) Dataset contains data where participants are asked to identify a gene mention in a sentence by giving its start and end characters. The training set consists of a set of sentences, and for each sentence a set of gene mentions (GENE annotations). [registration required for access], in English language. Containing 20 in n/a file format.

13 papers2 benchmarksTexts

NaturalProofs

The NaturalProofs Dataset is a large-scale dataset for studying mathematical reasoning in natural language. NaturalProofs consists of roughly 20,000 theorem statements and proofs, 12,500 definitions, and 1,000 additional pages (e.g. axioms, corollaries) derived from ProofWiki, an online compendium of mathematical proofs written by a community of contributors.

13 papers0 benchmarksTexts

HopeEDI (HopeEDI: A Multilingual Hope Speech Detection Dataset for Equality, Diversity, and Inclusion)

Over the past few years, systems have been developed to control online content and eliminate abusive, offensive or hate speech content. However, people in power sometimes misuse this form of censorship to obstruct the democratic right of freedom of speech. Therefore, it is imperative that research should take a positive reinforcement approach towards online content that is encouraging, positive and supportive contents. Until now, most studies have focused on solving this problem of negativity in the English language, though the problem is much more than just harmful content. Furthermore, it is multilingual as well. Thus, we have constructed a Hope Speech dataset for Equality, Diversity and Inclusion (HopeEDI) containing user-generated comments from the social media platform YouTube with 28,451, 20,198 and 10,705 comments in English, Tamil and Malayalam, respectively, manually labelled as containing hope speech or not. To our knowledge, this is the first research of its kind to annotate

13 papers3 benchmarks

GooAQ

GooAQ is a large-scale dataset with a variety of answer types. This dataset contains over 5 million questions and 3 million answers collected from Google. GooAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GooAQ answers are mined from Google's responses to the collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections.

13 papers0 benchmarksTexts

UPFD (User Preference-aware Fake News Detection)

For benchmarking, please refer to its variant UPFD-POL and UPFD-GOS.

13 papers0 benchmarksGraphs, Texts
PreviousPage 133 of 1000Next