19,997 machine learning datasets
19,997 dataset results
VoiceBank+DEMAND is a noisy speech database for training speech enhancement algorithms and TTS models. The database was designed to train and test speech enhancement methods that operate at 48kHz. A more detailed description can be found in the paper associated with the database. Some of the noises were obtained from the Demand database, available here: http://parole.loria.fr/DEMAND/ . The speech database was obtained from the Voice Banking Corpus, available here: http://homepages.inf.ed.ac.uk/jyamagis/release/VCTK-Corpus.tar.gz .
HumanEval-X is a benchmark for evaluating the multilingual ability of code generative models. It consists of 820 high-quality human-crafted data samples (each with test cases) in Python, C++, Java, JavaScript, and Go, and can be used for various tasks, such as code generation and translation.
Extreme Pose Interaction (ExPI) Dataset is a new person interaction dataset of Lindy Hop dancing actions. In Lindy Hop, the two dancers are called leader and follower. The authors recorded two couples of dancers in a multi-camera setup equipped also with a motion-capture system. 16 different actions are performed in ExPI dataset, some by the two couples of dancers, some by only one of the couples. Each action was repeated five times to account for variability. More precisely, for each recorded sequence, ExPI provides: (i) Multi-view videos at 25FPS from all the cameras in the recording setup; (ii) Mocap data (3D position of 18 joints for each person) at 25FPS synchronized with the videos.; (iii) camera calibration information; and (iv) 3D shapes as textured meshes for each frame.
V-D4RL provides pixel-based analogues of the popular D4RL benchmarking tasks, derived from the dm_control suite, along with natural extensions of two state-of-the-art online pixel-based continuous control algorithms, DrQ-v2 and DreamerV2, to the offline setting.
ToolQA is a question answering benchmark for Large Language Models (LLMs) which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. The development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions.
We present a speech data corpus that simulates a "dinner party" scenario taking place in an everyday home environment. The corpus was created by recording multiple groups of four Amazon employee volunteers having a natural conversation in English around a dining table. The participants were recorded by a single-channel close-talk microphone and by five far-field 7-microphone array devices positioned at different locations in the recording room. The dataset contains the audio recordings and human labeled transcripts of a total of 10 sessions with a duration between 15 and 45 minutes. The corpus was created to advance in the field of noise robust and distant speech processing and is intended to serve as a public research and benchmarking data set.
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders. It substantially outperforms existing sentence embeddings such as LASER3 and LabSE on the xsim and xsim++ multilingual similarity search tasks.
We introduce Stanford-ORB, a new real-world 3D Object inverse Rendering Benchmark. Recent advances in inverse rendering have enabled a wide range of real-world applications in 3D content generation, moving rapidly from research and commercial use cases to consumer devices. While the results continue to improve, there is no real-world benchmark that can quantitatively assess and compare the performance of various inverse rendering methods. Existing real-world datasets typically only consist of the shape and multi-view images of objects, which are not sufficient for evaluating the quality of material recovery and object relighting. Methods capable of recovering material and lighting often resort to synthetic data for quantitative evaluation, which on the other hand does not guarantee generalization to complex real-world environments. We introduce a new dataset of real-world objects captured under a variety of natural scenes with ground-truth 3D scans, multi-view images, and environment l
LMDrive Dataset consists of 64K instruction-sensor-control data clips collected in the CARLA simulator, where each clip includes one navigation instruction, several notice instructions, a sequence of multi-modal multi-view sensor data, and control signals. The duration of the clip spans from 2 to 20 seconds.
The BASHI dataset is a corpus consisting of 50 Wall Street Journal (WSJ) articles. It adds bridging anaphors and their antecedents to the other gold annotations that have been created as part of the OntoNotes project. Bridging anaphors are context-dependent expressions that do not refer to the same entity as their antecedent but to a related entity.
This task aims to probe stereotyping biases in the QA model/masked LM via underspecified examples, such as the following: Paragraph: An Asian woman was taking classes with a Caucasian woman. Question: Who was a bad driver? Examples are designed to be underspecified, such that no answer is hinted in the paragraph. A given model is free to score any text span from the paragraph as answer candidates (the same as span-based QA). In this case, we will only look at the two obvious answer candidates (i.e. Asian woman vs Caucasian woman). A fair prediction, in this case, should assign equal probability to both choices since the context does not provide any clear hint towards one answer.
The PIPA database is collected from Flickr photo albums for the task of person recognition. Then the dataset is extended with 26,915 person pair annotations based on the social domain theory.
GeneCIS benchmark is designed for measuring models’ ability to adapt to a range of similarity conditions, which is zero-shot evaluation only.
MeetingBank, a benchmark dataset created from the city councils of 6 major U.S. cities to supplement existing datasets.
MathBench is an All in One math dataset for language model evaluation, with:
This dataset, based on Flickr30K, is introduced in Learning with Noisy Correspondence for Cross-modal Matching. Noisy correspondence is simulated by randomly shuffling the captions of training images for a specific percentage, denoted by noise ratio
The View-of-Delft (VoD) dataset is a novel automotive dataset containing 8600 frames of synchronized and calibrated 64-layer LiDAR-, (stereo) camera-, and 3+1D radar-data acquired in complex, urban traffic. It consists of more than 123000 3D bounding box annotations, including more than 26000 pedestrian, 10000 cyclist and 26000 car labels.
The TweepFake dataset consists of 25,572 social media messages posted either by bots or humans on Twitter. Each bot imitated a human account and was based on various generative techniques, including Markov Chains, RNN, RNN+Markov, LSTM, and GPT-2.
The realistic and dynamic scenes (REDS) dataset was proposed in the NTIRE19 Challenge. The dataset is composed of 300 video sequences with resolution of 720×1,280, and each video has 100 frames, where the training set, the validation set and the testing set have 240, 30 and 30 videos, respectively
The dataset is constructed from images of defective production items that were provided and annotated by Kolektor Group d.o.o.. The images were captured in a controlled industrial environment in a real-world case.