Datasets

19,997 machine learning datasets

19,997 dataset results

Doc2Dial (Doc2Dial: Document-grounded Dialogue)

For goal-oriented document-grounded dialogs, it often involves complex contexts for identifying the most relevant information, which requires better understanding of the inter-relations between conversations and documents. Meanwhile, many online user-oriented documents use both semi-structured and unstructured contents for guiding users to access information of different contexts. Thus, we create a new goal-oriented document-grounded dialogue dataset that captures more diverse scenarios derived from various document contents from multiple domains such ssa.gov and studentaid.gov. For data collection, we propose a novel pipeline approach for dialogue data construction, which has been adapted and evaluated for several domains.

36 papers0 benchmarksTexts

STPLS3D

Our project (STPLS3D) aims to provide a large-scale aerial photogrammetry dataset with synthetic and real annotated 3D point clouds for semantic and instance segmentation tasks.

36 papers10 benchmarks3D, Images, Point cloud

AdvGLUE (Adversarial GLUE)

Adversarial GLUE (AdvGLUE) is a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations.

36 papers1 benchmarksTexts

LSUI (Large Scale Underwater Image Dataset)

We released a large-scale underwater image (LSUI) dataset including 5004 image pairs, which involve richer underwater scenes (lighting conditions, water types and target categories) and better visual quality reference images than the existing ones.

36 papers2 benchmarksImages

MAD

MAD (Movie Audio Descriptions) is an automatically curated large-scale dataset for the task of natural language grounding in videos or natural language moment retrieval. MAD exploits available audio descriptions of mainstream movies. Such audio descriptions are redacted for visually impaired audiences and are therefore highly descriptive of the visual content being displayed. MAD contains over 384,000 natural language sentences grounded in over 1,200 hours of video, and provides a unique setup for video grounding as the visual stream is truly untrimmed with an average video duration of 110 minutes. 2 orders of magnitude longer than legacy datasets.

36 papers29 benchmarksTexts, Videos

V2XSet

A large-scale V2X perception dataset using CARLA and OpenCDA

36 papers24 benchmarks3D, Images, LiDAR

CelebV-HQ

CelebV-HQ is a large-scale video facial attributes dataset with annotations. CelebV-HQ contains 35,666 video clips involving 15,653 identities and 83 manually labeled facial attributes covering appearance, action, and emotion.

36 papers16 benchmarksVideos

TEACh (Task-driven Embodied Agents that Chat)

Robots operating in human spaces must be able to engage in natural language interaction with people, both understanding and executing instructions, and using conversation to resolve ambiguity and recover from mistakes. To study this, we introduce TEACh, a dataset of over 3,000 human--human, interactive dialogues to complete household tasks in simulation. A Commander with access to oracle information about a task communicates in natural language with a Follower. The Follower navigates through and interacts with the environment to complete tasks varying in complexity from "Make Coffee" to "Prepare Breakfast", asking questions and getting additional information from the Commander. We propose three benchmarks using TEACh to study embodied intelligence challenges, and we evaluate initial models' abilities in dialogue understanding, language grounding, and task execution.

36 papers0 benchmarksDialog, Environment, Images

InfoSeek (Visual Information Seeking)

In this project, we introduce InfoSeek, a visual question answering dataset tailored for information-seeking questions that cannot be answered with only common sense knowledge. Using InfoSeek, we analyze various pre-trained visual question answering models and gain insights into their characteristics. Our findings reveal that state-of-the-art pre-trained multi-modal models (e.g., PaLI-X, BLIP2, etc.) face challenges in answering visual information-seeking questions, but fine-tuning on the InfoSeek dataset elicits models to use fine-grained knowledge that was learned during their pre-training.

36 papers2 benchmarksImages, Texts

VOTChallenge (Visual Object Tracking)

The Visual Object Tracking (VOT) dataset is a collection of video sequences used for evaluating and benchmarking visual object tracking algorithms. It provides a standardized platform for researchers and practitioners to assess the performance of different tracking methods.

36 papers0 benchmarks

MM-SafetyBench

The MultiModal Safety Benchmark (MM-SafetyBench) is a comprehensive framework designed for conducting safety-critical evaluations of Multimodal Large Language Models (MLLMs). It addresses the security concerns surrounding MLLMs, which can be compromised by query-relevant images, as if the text query itself were malicious¹².

36 papers0 benchmarks

50 Salads

Activity recognition research has shifted focus from distinguishing full-body motion patterns to recognizing complex interactions of multiple entities. Manipulative gestures – characterized by interactions between hands, tools, and manipulable objects – frequently occur in food preparation, manufacturing, and assembly tasks, and have a variety of applications including situational support, automated supervision, and skill assessment. With the aim to stimulate research on recognizing manipulative gestures we introduce the 50 Salads dataset. It captures 25 people preparing 2 mixed salads each and contains over 4h of annotated accelerometer and RGB-D video data. Including detailed annotations, multiple sensor types, and two sequences per participant, the 50 Salads dataset may be used for research in areas such as activity recognition, activity spotting, sequence analysis, progress tracking, sensor fusion, transfer learning, and user-adaptation.

35 papers12 benchmarksVideos

mebeblurf

Matanga Darknet — 2025 Access Guide

35 papers25 benchmarksImages

Arxiv HEP-TH citation graph

Arxiv HEP-TH (high energy physics theory) citation graph is from the e-print arXiv and covers all the citations within a dataset of 27,770 papers with 352,807 edges. If a paper i cites paper j, the graph contains a directed edge from i to j. If a paper cites, or is cited by, a paper outside the dataset, the graph does not contain any information about this. The data covers papers in the period from January 1993 to April 2003 (124 months).

35 papers16 benchmarksGraphs

Open Entity

The Open Entity dataset is a collection of about 6,000 sentences with fine-grained entity types annotations. The entity types are free-form noun phrases that describe appropriate types for the role the target entity plays in the sentence. Sentences were sampled from Gigaword, OntoNotes and web articles. On average each sentence has 5 labels.

35 papers1 benchmarksTexts

STRING

STRING is a collection of protein-protein interaction (PPI) networks.

35 papers0 benchmarksGraphs

BirdSong

The BirdSong dataset consists of audio recordings of bird songs at the H. J. Andrews (HJA) Experimental Forest, using unattended microphones. The goal of the dataset is to provide data to automatically identify the species of bird responsible for each utterance in these recordings. The dataset contains 548 10-seconds audio recordings.

35 papers0 benchmarksImages, Videos

COVID-19 Image Data Collection

Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.

35 papers1 benchmarksBiology, Biomedical, Images, Medical

doc2dial

A new dataset of goal-oriented dialogues that are grounded in the associated documents.

35 papers0 benchmarksTexts

Google Landmarks Dataset v2

This is the second version of the Google Landmarks dataset (GLDv2), which contains images annotated with labels representing human-made and natural landmarks. The dataset can be used for landmark recognition and retrieval experiments. This version of the dataset contains approximately 5 million images, split into 3 sets of images: train, index and test

35 papers0 benchmarksImages

PreviousPage 71 of 1000Next