TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

19,997 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2

19,997 dataset results

ChiMed-VL

ChiMed-VL Dataset ChiMed-VL-Alignment dataset ChiMed-VL-Alignment consists of 580,014 image-text couplings, each pair falling into one of two categories: context information of an image or descriptions of an image. The context category contains 167M tokens, presenting a median text length of 435 (Q1: 211, Q3: 757). Conversely, descriptions, more concise and image-specific, contain inline descriptions and captions. They comprise 63M tokens, with median lengths settling at 59 (Q1: 45, Q3: 83).

6 papers0 benchmarks

WSJ0-2mix-extr

WSJ0-2mix-extr is a speech extraction dataset

6 papers1 benchmarksAudio

WaterBench

Multi-level Benchmark of Watermarks for Large Language Models

6 papers0 benchmarksTexts

ARCADE (Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs Dataset)

ARCADE: Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs Dataset Phase 2 consist of two folders with 300 images in each of them as well as annotations.

6 papers0 benchmarksImages, Medical

PhoCAL

Object pose estimation is crucial for robotic applications and augmented reality. To provide a benchmark with high-quality ground truth annotations to the community, we introduce a multimodal dataset for category-level object pose estimation with photometrically challenging objects termed PhoCaL. PhoCaL comprises 60 high quality 3D models of household objects over 8 categories including highly reflective, transparent and symmetric objects. We developed a novel robot-supported multi-modal (RGB, depth, polarisation) data acquisition and annotation process. It ensures sub-millimeter accuracy of the pose for opaque textured, shiny and transparent objects, no motion blur and perfect camera synchronisation.

6 papers0 benchmarks

MusicBench

The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with Mustango text-to-music model. MusicBench is based on the MusicCaps dataset, which it expands from 5,521 samples to 52,768 training and 400 test samples!

6 papers1 benchmarksAudio, Music, Texts

CGIQA-6K (Computer Graphics Image Quality Assessment)

CGIQA-6k database is a large-scale, in-the-wild CGIQA database consisting of 6,000 CGIs. The CGIQA-6k dataset is a large-scale, in-the-wild database for Computer Graphics Image Quality Assessment (CGIQA). It consists of 6,000 Computer Graphics Images (CGIs). These CGIs are artificially generated visuals created using computer programs and are prevalent across various platforms, from video games to streaming media.

6 papers0 benchmarks

classification benchmark

This benchmark includes 11 image classification datasets that were used to evaluate the transferability of metrics. Datasets include FGVC Aircraft, Caltech101, Stanford Cars, CIFAR-10, CIFAR-100, DTD, Oxford-102, Flowers, Food-101, Oxford-IIIT Pets, SUN397, and VOC2007 . Please refer to SFDA (https://github.com/TencentARC/SFDA) or ETran (https://github.com/mgholamikn/ETran/tree/main) for further details about the benchmark.

6 papers1 benchmarks

MegaVeridicality

The MegaVeridicality Dataset is a collection of ordinal veridicality judgments as well as ordinal acceptability judgments for 773 clause-embedding verbs of English. It was created by Aaron Steven White and Kyle Rawlins. The dataset is used to study the complex array of inferences that different open-class lexical items trigger. For example, it examines why certain sentences give rise to specific inferences while structurally identical sentences trigger different inferences. The dataset also investigates how lexically triggered inferences are conditioned by surprising aspects of the syntactic context in which a word occurs. It provides a detailed description of item construction, and collection methods, and discusses how to use a dataset on this scale to address questions in linguistic theory.

6 papers0 benchmarks

sStoryCloze (Spoken Story Cloze)

The sStoryCloze refers to the Spoken StoryCloze benchmark, which is a spoken version of the StoryCloze dataset. The StoryCloze dataset consists of five-sentence commonsense stories, where the task is to predict the ending of a story given the first four sentences and two possible endings, one of which is the correct ending and the other is a distractor. The sStoryCloze evaluates the model's capabilities to capture fine-grained causal and temporal commonsense relations in spoken language. It assesses the model's ability to generate coherent and contextually appropriate continuations given a spoken prompt. The dataset is used to evaluate the performance of SpeechLMs in understanding and generating spoken narratives.

6 papers0 benchmarks

ShapeTalk (The ShapeTalk Dataset)

ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances

6 papers0 benchmarksImages, Texts

GlobalEgoMocap Test Dataset

Egocentric motion capture dataset

6 papers8 benchmarks

WebLINX (Real-World Website Navigation with Multi-Turn)

WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.

6 papers4 benchmarksActions, Images, RGB Video, Ranking, Texts, Videos

Full-body Parkinson’s disease dataset

A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease

6 papers1 benchmarks

LastLetterConcat

The LastLetterConcat dataset is a collection of word concatenations formed by taking the last letters of individual words and joining them together. Each entry in the dataset consists of a question and an answer, where the answer is the result of concatenating the last letters of specific words. Here are some examples:

6 papers0 benchmarks

HumanEval-XL

We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.

6 papers0 benchmarksTexts

BIPIA (Benchmark of Indirect Prompt Injection Attacks)

Recent advancements in large language models (LLMs) have led to their adoption across various applications, notably in combining LLMs with external content to generate responses. These applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise LLM's output, causing their responses to deviate from user expectations. Despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different LLMs is available due to the lack of a benchmark. Furthermore, no effective defense has been proposed. We introduce the first benchmark of indirect prompt injection attack, BIPIA, to measure the robustness of various LLMs and defenses against indirect prompt injection attacks. We hope that our benchmark and defenses can inspire future work in this important area.

6 papers0 benchmarks

OOD-CV (Out Of Distribution Generalization in Computer Vision)

Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1. Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2. Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3. We do not observe significant differences between convolutional and transformer architectures. We believe our datase

6 papers4 benchmarks3D, Images

VDD (Varied Drone Dataset for Semantic Segmentation)

Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential seman- tic details to understand scenes on the ground. Ensuring high accuracy of semantic segmentation models for drones requires access to diverse, large-scale, and high-resolution datasets, which are often scarce in the field of aerial image processing. While existing datasets typically focus on urban scenes and are relatively small, our Varied Drone Dataset (VDD) addresses these limitations by offering a large-scale, densely labeled collection of 400 high-resolution images spanning 7 classes. This dataset features various scenes in urban, industrial, rural, and natural areas, captured from different camera angles and under diverse lighting conditions.

6 papers2 benchmarks

BraTS2018

Multimodal Brain Tumor Segmentation Challenge 2018

6 papers0 benchmarks
PreviousPage 206 of 1000Next