19,997 machine learning datasets
19,997 dataset results
FLIP includes several benchmark datasets that contain a variety of protein sequences, each with a real-valued label indicating its "fitness" (how well the protein performs some particular function). The goal is to predict the fitness of a given protein sequence using the sequence. Different representations of protein sequences (e.g. learned embeddings from large language models) may prove helpful here.
In this paper, we introduce the MoisesDB dataset for musical source separation. It consists of 240 tracks from 45 artists, covering twelve musical genres. For each song, we provide its individual audio sources, organized in a two-level hierarchical taxonomy of stems. This will facilitate building and evaluating fine-grained source separation systems that go beyond the limitation of using four stems (drums, bass, other, and vocals) due to lack of data. To facilitate the adoption of this dataset, we publish an easy-to-use Python library to download, process and use MoisesDB. Alongside a thorough documentation and analysis of the dataset contents, this work provides baseline results for open-source separation models for varying separation granularities (four, five, and six stems), and discuss their results.
From Fidelity to Perceptual Quality: A Semi-Supervised Approach for Low-Light Image Enhancement
We scrape data from GooBix, which contains 156 games of 5 × 5 mini crosswords. The goal is not just to solve the task, as more general crosswords can be readily solved with specialized NLP pipelines that leverage large-scale retrieval instead of LM. Rather, we aim to explore the limit of LM as a general problem solver that explores its own thoughts and guides its own exploration with deliberate reasoning as heuristics.
Casia V1 is a dataset for forgery classification. Casia V1+ is a modification of the Casia V1 dataset proposed by Chen et al. that replaces authentic images that also exist in Casiav2 with images from the COREL dataset to avoid data contamination.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The ConceptARC dataset is a benchmark for evaluating understanding and generalization in the Abstraction and Reasoning Corpus (ARC) domain. It was developed by Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The ability to form and abstract concepts is key to human intelligence, but such abilities remain lacking in state-of-the-art AI systems. There has been substantial research on conceptual abstraction in AI, particularly using idealized domains such as Raven's Progressive Matrices and Bongard problems.
A total of 170 videos for training and 30 videos for testing, each of which has 60 frames, amounting to 12,000 paired data. (Note that the first and last 30 frames of each video are NOT consecutive, and their darknesses are simulated differently as well.)
The CALLHOME English Corpus is a collection of unscripted telephone conversations between native speakers of English. Here are the key details:
NPHardEval is a dynamic benchmark designed to assess the reasoning abilities of Large Language Models (LLMs) across a broad spectrum of algorithmic questions. Let's delve into the details:
The AmbigNQ dataset is a valuable resource for exploring ambiguity in open-domain question answering. Let me provide you with some details:
Object HalBench is a benchmark used to evaluate the performance of Language Models, particularly those that are multimodal (i.e., they can process and generate both text and images). It's designed to test how well these models can avoid "hallucinations" - generating text that is not factually grounded in the images they're processing¹.
The EARS-WHAM dataset mixes speech from the EARS dataset with real noise recordings from the WHAM! dataset. Speech and noise files are mixed at signal-to-noise ratios (SNRs) randomly sampled in a range of [−2.5, 17.5] dB, where the SNR is computed using loudness K- weighted relative to full scale (LKFS) standardized in ITU-R BS.1770 to obtain a more perceptually meaningful scaling and also to remove silent regions from the SNR computation.
VNBench is a comprehensive benchmark suite for video generative models, which evaluates video generation quality across specific, hierarchical, and disentangled dimensions, each with tailored prompts and evaluation methods. The suite includes 16 dimensions for evaluating Text-to-Video (T2V) models, such as subject consistency, motion smoothness, and overall consistency. VNBench also supports evaluating Image-to-Video (I2V) models and has recently introduced VBench-Long for evaluating longer videos¹. It's designed to align with human perceptions and provide valuable insights for future developments in video generation.
A popular dataset for node classification on heterogeneous graphs.
Source: HME100K
Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.
Cityscapes-Seq is a standard dataset for semantic urban scene understanding, featuring real-world videos from 50 cities in Germany and neighboring countries. It comprises 2,975 training video clips and 500 validation video clips, and each clip contains continuous 30 frames.
Forgery Diversity: DF40 comprises 40 distinct deepfake techniques (both representive and SOTA methods are included), facilitating the detection of nowadays' SOTA deepfakes and AIGCs. We provide 10 face-swapping methods, 13 face-reenactment methods, 12 entire face synthesis methods, and 5 face editing.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).