Datasets

19,997 machine learning datasets

19,997 dataset results

VoiceBank+DEMAND

VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train and test speech enhancement methods that operate at 48kHz. A more detailed description can be found in the paper associated with the database. Some of the noises were obtained from the Demand database, available here: http://parole.loria.fr/DEMAND/ . The speech database was obtained from the Voice Banking Corpus, available here: http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz .

16 papers8 benchmarks

HumanEval-X

HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.

16 papers0 benchmarksTexts

Expi (Extreme Pose Interaction)

Extreme Pose Interaction (ExPI) Dataset is a new person interaction dataset of Lindy Hop dancing actions. In Lindy Hop, the two dancers are called leader and follower. The authors recorded two couples of dancers in a multi-camera setup equipped also with a motion-capture system. 16 different actions are performed in ExPI dataset, some by the two couples of dancers, some by only one of the couples. Each action was repeated five times to account for variability. More precisely, for each recorded sequence, ExPI provides: (i) Multi-view videos at 25FPS from all the cameras in the recording setup; (ii) Mocap data (3D position of 18 joints for each person) at 25FPS synchronized with the videos.; (iii) camera calibration information; and (iv) 3D shapes as textured meshes for each frame.

16 papers0 benchmarksTracking, Videos

V-D4RL

V-D4RL provides pixel-based analogues of the popular D4RL benchmarking tasks, derived from the dm_control suite, along with natural extensions of two state-of-the-art online pixel-based continuous control algorithms, DrQ-v2 and DreamerV2, to the offline setting.

16 papers0 benchmarksActions, Images, Replay data

ToolQA

ToolQA is a question answering benchmark for Large Language Models (LLMs) which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. The development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions.

16 papers0 benchmarksTexts

DiPCo (DiPCo -- Dinner Party Corpus)

We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.

16 papers0 benchmarksAudio

SONAR

SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.

16 papers0 benchmarksAudio, Speech, Texts

Stanford-ORB

We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment l

16 papers5 benchmarks

LMDrive (LMDrive Dataset)

LMDrive Dataset consists of 64K instruction-sensor-control data clips collected in the CARLA simulator, where each clip includes one navigation instruction, several notice instructions, a sequence of multi-modal multi-view sensor data, and control signals. The duration of the clip spans from 2 to 20 seconds.

16 papers0 benchmarks

BASHI

The BASHI dataset is a corpus consisting of 50 Wall Street Journal (WSJ) articles. It adds bridging anaphors and their antecedents to the other gold annotations that have been created as part of the OntoNotes project. Bridging anaphors are context-dependent expressions that do not refer to the same entity as their antecedent but to a related entity.

16 papers0 benchmarks

UnQover

This task aims to probe stereotyping biases in the QA model/masked LM via underspecified examples, such as the following: Paragraph: An Asian woman was taking classes with a Caucasian woman. Question: Who was a bad driver? Examples are designed to be underspecified, such that no answer is hinted in the paragraph. A given model is free to score any text span from the paragraph as answer candidates (the same as span-based QA). In this case, we will only look at the two obvious answer candidates (i.e. Asian woman vs Caucasian woman). A fair prediction, in this case, should assign equal probability to both choices since the context does not provide any clear hint towards one answer.

16 papers0 benchmarks

PIPA (People in Photo Album)

The PIPA database is collected from Flickr photo albums for the task of person recognition. Then the dataset is extended with 26,915 person pair annotations based on the social domain theory.

16 papers2 benchmarks

GeneCIS

GeneCIS benchmark is designed for measuring models’ ability to adapt to a range of similarity conditions, which is zero-shot evaluation only.

16 papers4 benchmarksImages, Texts

MeetingBank

MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets.

16 papers3 benchmarksAudio, Texts, Videos

MathBench (MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark)

MathBench is an All in One math dataset for language model evaluation, with:

16 papers0 benchmarks

Flickr30K-Noisy (Flickr-30K with 20% of Noisy Correspondence)

This dataset, based on Flickr30K, is introduced in Learning with Noisy Correspondence for Cross-modal Matching. Noisy correspondence is simulated by randomly shuffling the captions of training images for a specific percentage, denoted by noise ratio

16 papers21 benchmarksImages

View-of-Delft

The View-of-Delft (VoD) dataset is a novel automotive dataset containing 8600 frames of synchronized and calibrated 64-layer LiDAR-, (stereo) camera-, and 3+1D radar-data acquired in complex, urban traffic. It consists of more than 123000 3D bounding box annotations, including more than 26000 pedestrian, 10000 cyclist and 26000 car labels.

16 papers0 benchmarks

TweepFake

The TweepFake dataset consists of 25,572 social media messages posted either by bots or humans on Twitter. Each bot imitated a human account and was based on various generative techniques, including Markov Chains, RNN, RNN+Markov, LSTM, and GPT-2.

16 papers2 benchmarksTexts

REDS (REalistic and Diverse Scenes dataset realistic and dynamic scenes)

The realistic and dynamic scenes (REDS) dataset was proposed in the NTIRE19 Challenge. The dataset is composed of 300 video sequences with resolution of 720×1,280, and each video has 100 frames, where the training set, the validation set and the testing set have 240, 30 and 30 videos, respectively

15 papers4 benchmarksImages, Videos

KolektorSDD (Kolektor Surface-Defect Dataset)

The dataset is constructed from images of defective production items that were provided and annotated by Kolektor Group d.o.o.. The images were captured in a controlled industrial environment in a real-world case.

15 papers2 benchmarksImages

PreviousPage 120 of 1000Next