TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

1,019 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

1,019 dataset results

ORGaze

A new video dataset for OR, with 30, 000 objects over 5, 000 stereo video sequences annotated for their descriptions and gaze.

2 papers0 benchmarksVideos

Parkinson's Pose Estimation Dataset

The data includes all movement trajectories extracted from the videos of Parkinson's assessments using Convolutional Pose Machines (CPM) as well as the confidence values from CPM. The dataset also includes ground truth ratings of parkinsonism and dyskinesia severity using the UDysRS, UPDRS, and CAPSIT.

2 papers0 benchmarksImages, Videos

Twitch-FIFA

Twitch-FIFA is video-context, many-speaker dialogue dataset based on live-broadcast soccer game videos and chats from Twitch.tv. This dataset can be used to train visually-grounded dialogue models that generate relevant temporal and spatial event language from the live video, while also being relevant to the chat history.

2 papers0 benchmarksTexts, Videos

Skeletics 152

A curated and 3-D pose-annotated subset of RGB videos sourced from Kinetics-700, a large-scale action dataset.

2 papers0 benchmarksVideos

ASD (Annotated Semantic Dataset)

The Annotated Semantic Dataset is composed of $11$ videos, divided in $3$ activity categories: Biking; Driving and Walking, according to their amount of semantic information. The classes are: $0p$, which represents the videos with approximately no semantic information; $25p$, for the videos containing relevant semantic information in ∼$25%$ of its frames ; the same ideia for the classes $50p$ and $75p$, The videos were record using a GoPro Hero 3 camera mounted in a helmet for the Biking and Walking videos and attached to a head strap for the Driving videos.

2 papers0 benchmarksVideos

PESMOD (PExels Small Moving Object Detection)

The PESMOD (PExels Small Moving Object Detection) dataset consists of high resolution aerial images in which moving objects are labelled manually. It was created from videos selected from the Pexels website. The aim of this dataset is to provide a different and challenging dataset for moving object detection methods evaluation. Each moving object is labelled for each frame with PASCAL VOC format in a XML file. The dataset consists of 8 different video sequences.

2 papers0 benchmarksVideos

R2VQ (Recipe-to-Video Questions)

R2VQ is a dataset designed for testing competence-based comprehension of machines over a multimodal recipe collection, which contains text-video aligned recipes.

2 papers0 benchmarksTexts, Videos

Fetoscopy Placenta Data

The fetoscopy placenta dataset is associated with our MICCAI2020 publication titled “Deep Placental Vessel Segmentation for Fetoscopic Mosaicking”. The dataset contains 483 frames with ground-truth vessel segmentation annotations taken from six different in vivo fetoscopic procedure videos. The dataset also includes six unannotated in vivo continuous fetoscopic video clips (950 frames) with predicted vessel segmentation maps obtained from the leave-one-out cross-validation of our method.

2 papers0 benchmarksImages, Videos

RISEdb (Robust Indoor Localization in Complex Scenarios (RISE) database)

The RISE (Robust Indoor Localization in Complex Scenarios) dataset is meant to train and evaluate visual indoor place recognizers. It contains more than 1 million geo-referenced images spread over 30 sequences, covering 5 heterogeneous buildings. For each building we provide: - A high resolution 3D point cloud (1cm) that defines the localization reference frame and that was generated with a mobile laser scanner and an inertial system. - Several image sequences spread over time with accurate ground truth poses retrieved by the laser scanner. Each sequence contains both, stereo pairs and spherical images. - Geo-referenced smartphone data, retrieved from the standard sensors of such devices.

2 papers0 benchmarks3D, Images, LiDAR, Videos

Large-scale Anomaly Detection

Large-scale Anomaly Detection (LAD) is a database to benchmark anomaly detection in video sequences, which is featured in two aspects. 1) It contains 2000 video sequences including normal and abnormal video clips with 14 anomaly categories including crash, fire, violence, etc. with large scene varieties, making it the largest anomaly analysis database to date. 2) It provides the annotation data, including video-level labels (abnormal/normal video, anomaly type) and frame-level labels (abnormal/normal video frame) to facilitate anomaly detection.

2 papers0 benchmarksVideos

Hockey Fight Detection Dataset

Whereas the action recognition community has focused mostly on detecting simple actions like clapping, walking or jogging, the detection of fights or in general aggressive behaviors has been comparatively less studied. Such capability may be extremely useful in some video surveillance scenarios like in prisons, psychiatric or elderly centers or even in camera phones. After an analysis of previous approaches we test the well-known Bag-of-Words framework used for action recognition in the specific problem of fight detection, along with two of the best action descriptors currently available: STIP and MoSIFT. For the purpose of evaluation and to foster research on violence detection in video we introduce a new video database containing 1000 sequences divided in two groups: fights and non-fights. Experiments on this database and another one with fights from action movies show that fights can be detected with near 90% accuracy.

2 papers4 benchmarksVideos

MuCo-VQA

MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.

2 papers0 benchmarksImages, Texts, Videos

ERATO

ERATO is a large-scale multi-modal dataset for Pairwise Emotional Relationship Recognition (PERR). It has 31,182 video clips, lasting about 203 video hours. Different from the existing datasets, ERATO contains interaction-centric videos with multi-shots, varied video length, and multiple modalities including visual, audio and text

2 papers0 benchmarksVideos

MAVS (Multilingual Audio-Visual Smartphone dataset)

MAVS is an audio-visual smartphone dataset captured in five different recent smartphones. This new dataset contains 103 subjects captured in three different sessions considering the different real-world scenarios. Three different languages are acquired in this dataset to include the problem of language dependency of the speaker recognition systems.

2 papers0 benchmarksSpeech, Videos

MAAD

The Model for Attended Awareness in Driving (MAAD) is a dataset of third-person estimates of a driver’s attended awareness. It consists of videos of a scene, as seen by a person performing a task in the scene, along with noisily registered ego-centric gaze sequences from that person.

2 papers0 benchmarksVideos

Kinetics-Sound

This is a subset of Kinetics-400, introduced in Look, Listen and Learn by Relja Arandjelovic and Andrew Zisserman.

2 papers0 benchmarksAudio, Videos

VGG-Sound Sync

VGG-Sound Sync is an audio-visual synchronisation benchmark based on videos collected from YouTube. VGG-Sound Sync contains over 100k video clips, spanning 160 classes and can be downloaded here.

2 papers0 benchmarksVideos

MetaVD (Meta Video Dataset)

MetaVD is a Meta Video Dataset for enhancing human action recognition datasets. It provides human-annotated relationship labels between action classes across human action recognition datasets. MetaVD is proposed in the following paper: Yuya Yoshikawa, Yutaro Shigeto, and Akikazu Takeuchi. "MetaVD: A Meta Video Dataset for enhancing human action recognition datasets." Computer Vision and Image Understanding 212 (2021): 103276. [link]

2 papers0 benchmarksGraphs, Videos

RLD (Responsive Listener Dataset)

RLD (Responsive Listener Dataset) is a conversation video corpus collected from the public resources featuring 67 speakers, 76 listeners with three different attitudes. Through non-verbal signals response to the speakers' words, intonations, or behaviors in real-time, listeners show how they are engaged in dialogue.

2 papers0 benchmarksVideos

ITB (Informative Tracking Benchmark)

Informative Tracking Benchmark (ITB) is a small and informative tracking benchmark with 7% out of 1.2 M frames of existing and newly collected datasets, which enables efficient evaluation while ensuring effectiveness. Specifically, the authors designed a quality assessment mechanism to select the most informative sequences from existing benchmarks taking into account 1) challenging level, 2) discriminative strength, 3) and density of appearance variations. Furthermore, they collect additional sequences to ensure the diversity and balance of tracking scenarios, leading to a total of 20 sequences for each scenario.

2 papers2 benchmarksVideos
PreviousPage 33 of 51Next