19,997 machine learning datasets
19,997 dataset results
ChiMed-VL Dataset ChiMed-VL-Alignment dataset ChiMed-VL-Alignment consists of 580,014 image-text couplings, each pair falling into one of two categories: context information of an image or descriptions of an image. The context category contains 167M tokens, presenting a median text length of 435 (Q1: 211, Q3: 757). Conversely, descriptions, more concise and image-specific, contain inline descriptions and captions. They comprise 63M tokens, with median lengths settling at 59 (Q1: 45, Q3: 83).
WSJ0-2mix-extr is a speech extraction dataset
Multi-level Benchmark of Watermarks for Large Language Models
ARCADE: Automatic Region-based Coronary Artery Disease diagnostics using x-ray angiography imagEs Dataset Phase 2 consist of two folders with 300 images in each of them as well as annotations.
Object pose estimation is crucial for robotic applications and augmented reality. To provide a benchmark with high-quality ground truth annotations to the community, we introduce a multimodal dataset for category-level object pose estimation with photometrically challenging objects termed PhoCaL. PhoCaL comprises 60 high quality 3D models of household objects over 8 categories including highly reflective, transparent and symmetric objects. We developed a novel robot-supported multi-modal (RGB, depth, polarisation) data acquisition and annotation process. It ensures sub-millimeter accuracy of the pose for opaque textured, shiny and transparent objects, no motion blur and perfect camera synchronisation.
The MusicBench dataset is a music audio-text pair dataset that was designed for text-to-music generation purpose and released along with Mustango text-to-music model. MusicBench is based on the MusicCaps dataset, which it expands from 5,521 samples to 52,768 training and 400 test samples!
CGIQA-6k database is a large-scale, in-the-wild CGIQA database consisting of 6,000 CGIs. The CGIQA-6k dataset is a large-scale, in-the-wild database for Computer Graphics Image Quality Assessment (CGIQA). It consists of 6,000 Computer Graphics Images (CGIs). These CGIs are artificially generated visuals created using computer programs and are prevalent across various platforms, from video games to streaming media.
This benchmark includes 11 image classification datasets that were used to evaluate the transferability of metrics. Datasets include FGVC Aircraft, Caltech101, Stanford Cars, CIFAR-10, CIFAR-100, DTD, Oxford-102, Flowers, Food-101, Oxford-IIIT Pets, SUN397, and VOC2007 . Please refer to SFDA (https://github.com/TencentARC/SFDA) or ETran (https://github.com/mgholamikn/ETran/tree/main) for further details about the benchmark.
The MegaVeridicality Dataset is a collection of ordinal veridicality judgments as well as ordinal acceptability judgments for 773 clause-embedding verbs of English. It was created by Aaron Steven White and Kyle Rawlins. The dataset is used to study the complex array of inferences that different open-class lexical items trigger. For example, it examines why certain sentences give rise to specific inferences while structurally identical sentences trigger different inferences. The dataset also investigates how lexically triggered inferences are conditioned by surprising aspects of the syntactic context in which a word occurs. It provides a detailed description of item construction, and collection methods, and discusses how to use a dataset on this scale to address questions in linguistic theory.
The sStoryCloze refers to the Spoken StoryCloze benchmark, which is a spoken version of the StoryCloze dataset. The StoryCloze dataset consists of five-sentence commonsense stories, where the task is to predict the ending of a story given the first four sentences and two possible endings, one of which is the correct ending and the other is a distractor. The sStoryCloze evaluates the model's capabilities to capture fine-grained causal and temporal commonsense relations in spoken language. It assesses the model's ability to generate coherent and contextually appropriate continuations given a spoken prompt. The dataset is used to evaluate the performance of SpeechLMs in understanding and generating spoken narratives.
ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. The dataset provides discriminative utterances for a total of 36,391 shapes, across 30 object classes. Overall, ShapeTalk contains 73,799 distinct contexts, and a total of 536,596 utterances
Egocentric motion capture dataset
WebLINX is a large-scale benchmark of 100K interactions across 2300 expert demonstrations of conversational web navigation. It covers a broad range of patterns on over 150 real-world websites and can be used to train and evaluate agents in diverse scenarios.
A public data set of walking full-body kinematics and kinetics in individuals with Parkinson’s disease
The LastLetterConcat dataset is a collection of word concatenations formed by taking the last letters of individual words and joining them together. Each entry in the dataset consists of a question and an answer, where the answer is the result of concatenating the last letters of specific words. Here are some examples:
We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
Recent advancements in large language models (LLMs) have led to their adoption across various applications, notably in combining LLMs with external content to generate responses. These applications, however, are vulnerable to indirect prompt injection attacks, where malicious instructions embedded within external content compromise LLM's output, causing their responses to deviate from user expectations. Despite the discovery of this security issue, no comprehensive analysis of indirect prompt injection attacks on different LLMs is available due to the lack of a benchmark. Furthermore, no effective defense has been proposed. We introduce the first benchmark of indirect prompt injection attack, BIPIA, to measure the robustness of various LLMs and defenses against indirect prompt injection attacks. We hope that our benchmark and defenses can inspire future work in this important area.
Enhancing the robustness of vision algorithms in real-world scenarios is challenging. One reason is that existing robustness benchmarks are limited, as they either rely on synthetic data or ignore the effects of individual nuisance factors. We introduce OOD-CV, a benchmark dataset that includes out-of-distribution examples of 10 object categories in terms of pose, shape, texture, context and the weather conditions, and enables benchmarking models for image classification, object detection, and 3D pose estimation. In addition to this novel dataset, we contribute extensive experiments using popular baseline methods, which reveal that: 1. Some nuisance factors have a much stronger negative effect on the performance compared to others, also depending on the vision task. 2. Current approaches to enhance robustness have only marginal effects, and can even reduce robustness. 3. We do not observe significant differences between convolutional and transformer architectures. We believe our datase
Semantic segmentation of drone images is critical for various aerial vision tasks as it provides essential seman- tic details to understand scenes on the ground. Ensuring high accuracy of semantic segmentation models for drones requires access to diverse, large-scale, and high-resolution datasets, which are often scarce in the field of aerial image processing. While existing datasets typically focus on urban scenes and are relatively small, our Varied Drone Dataset (VDD) addresses these limitations by offering a large-scale, densely labeled collection of 400 high-resolution images spanning 7 classes. This dataset features various scenes in urban, industrial, rural, and natural areas, captured from different camera angles and under diverse lighting conditions.
Multimodal Brain Tumor Segmentation Challenge 2018