Datasets

19,997 machine learning datasets

19,997 dataset results

SceneNN

SceneNN is an RGB-D scene dataset consisting of more than 100 indoor scenes. The scenes are captured at various places, e.g., offices, dormitory, classrooms, pantry, etc., from University of Massachusetts Boston and Singapore University of Technology and Design. All scenes are reconstructed into triangle meshes and have per-vertex and per-pixel annotation. The dataset is additionally enriched with fine-grained information such as axis-aligned bounding boxes, oriented bounding boxes, and object poses.

63 papers2 benchmarks3D, Images, RGB-D

Sentence Compression

Sentence Compression is a dataset where the syntactic trees of the compressions are subtrees of their uncompressed counterparts, and hence where supervised systems which require a structural alignment between the input and output can be successfully trained.

63 papers0 benchmarksTexts

LRS3-TED

LRS3-TED is a multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

63 papers13 benchmarksVideos

SBIC (Social Bias Inference Corpus)

To support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups.

63 papers0 benchmarks

ScanRefer Dataset

Contains 51,583 descriptions of 11,046 objects from 800 ScanNet scenes. ScanRefer is the first large-scale effort to perform object localization via natural language expression directly in 3D.

63 papers4 benchmarks

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available. We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

63 papers7 benchmarksAudio, Images, Texts

ESD (Emotional Speech Database)

ESD is an Emotional Speech Database for voice conversion research. The ESD database consists of 350 parallel utterances spoken by 10 native English and 10 native Chinese speakers and covers 5 emotion categories (neutral, happy, angry, sad and surprise). More than 29 hours of speech data were recorded in a controlled acoustic environment. The database is suitable for multi-speaker and cross-lingual emotional voice conversion studies.

63 papers0 benchmarksSpeech

CSL-Daily

CSL-Daily (Chinese Sign Language Corpus) is a large-scale continuous SLT dataset. It provides both spoken language translations and gloss-level annotations. The topic revolves around people's daily lives (e.g., travel, shopping, medical care), the most likely SLT application scenario.

63 papers2 benchmarksRGB Video, Texts, Videos

SLAKE

SLAKE is an English-Chinese bilingual dataset consisting of 642 images and 14,028 question-answer pairs for training and testing Med-VQA systems.

63 papers0 benchmarksImages, Medical, Texts

LFWA

LFWA is a popular unconstrained facial attribute dataset, which consists of 13,143 facial images of 5,749 identities. Each facial image has 40 attribute annotations.

63 papers5 benchmarks

AgentBench

AgentBench is a comprehensive benchmark designed to evaluate Large Language Models (LLMs) as agents in interactive environments. LLMs, which are increasingly smart and autonomous, have expanded beyond traditional natural language processing tasks to tackle real-world pragmatic missions. Here are the key details about AgentBench:

63 papers0 benchmarks

UTD-MHAD

The UTD-MHAD dataset consists of 27 different actions performed by 8 subjects. Each subject repeated the action for 4 times, resulting in 861 action sequences in total. The RGB, depth, skeleton and the inertial sensor signals were recorded.

62 papers3 benchmarksImages, Videos

Polyvore (Polyvore Outfits)

This dataset contains 21,889 outfits from polyvore.com, in which 17,316 are for training, 1,497 for validation and 3,076 for testing.

62 papers4 benchmarksImages

ComplexWebQuestions

ComplexWebQuestions is a dataset for answering complex questions that require reasoning over multiple web snippets. It contains a large set of complex questions in natural language, and can be used in multiple ways:

62 papers4 benchmarksTexts

FaceScape

FaceScape dataset provides 3D face models, parametric models and multi-view images in large-scale and high-quality. The camera parameters, the age and gender of the subjects are also included. The data have been released to public for non-commercial research purpose.

62 papers5 benchmarksImages

COCO-QA

COCO-QA is a dataset for visual question answering. It consists of:

62 papers0 benchmarksImages, Videos

TNL2K (Tracking by natural language)

Tracking by Natural Language (TNL2K) is constructed for the evaluation of tracking by natural language specification. TNL2K features:

62 papers8 benchmarks

Gaze360 (Physically Unconstrained Gaze Estimation in the Wild)

Understanding where people are looking is an informative social cue. In this work, we present Gaze360, a large-scale gaze-tracking dataset and method for robust 3D gaze estimation in unconstrained images. Our dataset consists of 238 subjects in indoor and outdoor environments with labelled 3D gaze across a wide range of head poses and distances. It is the largest publicly available dataset of its kind by both subject and variety, made possible by a simple and efficient collection method. Our proposed 3D gaze model extends existing models to include temporal information and to directly output an estimate of gaze uncertainty. We demonstrate the benefits of our model via an ablation study, and show its generalization performance via a cross-dataset evaluation against other recent gaze benchmark datasets. We furthermore propose a simple self-supervised approach to improve cross-dataset domain adaptation. Finally, we demonstrate an application of our model for estimating customer attention

62 papers1 benchmarks

TaxiBJ

TaxiBJ consists of trajectory data from taxicab GPS data and meteorology data in Beijing from four time intervals: 1st Jul. 2013 - 30th Otc. 2013, 1st Mar. 2014 - 30th Jun. 2014, 1st Mar. 2015 - 30th Jun. 2015, 1st Nov. 2015 - 10th Apr. 2016.

62 papers0 benchmarksImages

DialogSum

DialogSum is a large-scale dialogue summarization dataset, consisting of 13,460 dialogues with corresponding manually labeled summaries and topics.

62 papers4 benchmarksTexts

PreviousPage 48 of 1000Next