Datasets

3,148 machine learning datasets

3,148 dataset results

BUSTER (BUSiness Transaction Entity Recognition dataset.)

BUSiness Transaction Entity Recognition dataset.

AS-V2 (The All-Seeing Dataset v2)

We propose a novel task, termed Relation Conversation (ReC), which unifies the formulation of text generation, object localization, and relation comprehension. Based on the unified formulation, we construct the AS-V2 dataset, which consists of 127K high-quality relation conversation samples, to unlock the ReC capability for Multi-modal Large Language Models (MLLMs).

3 papers0 benchmarksImages, Texts

IllusionVQA

IllusionVQA is a Visual Question Answering (VQA) dataset with two sub-tasks. The first task tests comprehension on 435 instances in 12 optical illusion categories. Each instance consists of an image with an optical illusion, a question, and 3 to 6 options, one of which is the correct answer. We refer to this task as Logo IllusionVQA-Comprehension. The second task tests how well VLMs can differentiate geometrically impossible objects from ordinary objects when two objects are presented side by side. The task consists of 1000 instances following a similar format to the first task. We refer to this task as Logo IllusionVQA-Soft-Localization.

3 papers2 benchmarksImages, Texts

xMIND (A Multilingual Dataset for Cross-lingual News Recommendation)

xMIND is an open, large-scale multilingual news dataset for multi- and cross-lingual news recommendation. xMIND is derived from the English MIND dataset using open-source neural machine translation (i.e., NLLB 3.3B).

3 papers0 benchmarksTexts

UIT-ViCoV19QA

The dataset comprises 4,500 question-answer pairs collected from trusted medical sources, with at least one answer and at most four unique paraphrased answers per question

3 papers0 benchmarksTexts

MMCode

MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.

3 papers0 benchmarksImages, Tables, Texts

Visual Writing Prompts

Hugging Face Datasets (New!) | Website | Github Repository | arXiv e-Print

3 papers0 benchmarksImages, Texts

STEM

This dataset is proposed in the ICLR 2024 paper: Measuring Vision-Language STEM Skills of Neural Models. The problems in the real world often require solutions, combining knowledge from STEM (science, technology, engineering, and math). Unlike existing datasets, our dataset requires the understanding of multimodal vision-language information of STEM. Our dataset features one of the largest and most comprehensive datasets for the challenge. It includes 448 skills and 1,073,146 questions spanning all STEM subjects. Compared to existing datasets that often focus on examining expert-level ability, our dataset includes fundamental skills and questions designed based on the K-12 curriculum. We also add state-of-the-art foundation models such as CLIP and GPT-3.5-Turbo to our benchmark.

3 papers0 benchmarksImages, Texts

Xhate999

We present XHate-999, a multi-domain and multilingual evaluation data set for abusive language detection. By aligning test instances across six typologically diverse languages, XHate-999 for the first time allows for disentanglement of the domain transfer and language transfer effects in abusive language detection. We conduct a series of domain- and language-transfer experiments with state-of-the-art monolingual and multilingual transformer models, setting strong baseline results and profiling XHate-999 as a comprehensive evaluation resource for abusive language detection. Finally, we show that domain- and language-adaption, via intermediate masked language modeling on abusive corpora in the target language, can lead to substantially improved abusive language detection in the target language in the zero-shot transfer setups.

3 papers0 benchmarksTexts

TriviaHG

Nowadays, individuals tend to engage in dialogues with Large Language Models, seeking answers to their questions. In times when such answers are readily accessible to anyone, the stimulation and preservation of human’s cognitive abilities, as well as the assurance of maintaining good reasoning skills by humans becomes crucial. This study addresses such needs by proposing hints (instead of final answers or before giving answers) as a viable solution. We introduce a framework for the automatic hint generation for factoid questions, employing it to construct TriviaHG, a novel large-scale dataset featuring 160,230 hints corresponding to 16,645 questions from the TriviaQA dataset. Additionally, we present an automatic evaluation method that measures the Convergence and Familiarity quality attributes of hints. To evaluate the TriviaHG dataset and the proposed evaluation method, we enlisted 10 individuals to annotate 2,791 hints and tasked 6 humans with answering questions using the provided

3 papers0 benchmarksTexts

PuzzleVQA

Large multimodal models extend the impressive capabilities of large language models by integrating multimodal understanding abilities. However, it is not clear how they can emulate the general intelligence and reasoning ability of humans. As recognizing patterns and abstracting concepts are key to general intelligence, we introduce PuzzleVQA, a collection of puzzles based on abstract patterns. With this dataset, we evaluate large multimodal models with abstract patterns based on fundamental concepts, including colors, numbers, sizes, and shapes. Through our experiments on state-of-the-art large multimodal models, we find that they are not able to generalize well to simple abstract patterns. Notably, even GPT-4V cannot solve more than half of the puzzles. To diagnose the reasoning challenges in large multimodal models, we progressively guide the models with our ground truth reasoning explanations for visual perception, inductive reasoning, and deductive reasoning. Our systematic analysi

3 papers0 benchmarksImages, Texts

Dermatology ddx dataset

The dermatology differential diagnoses (ddx) dataset for skin condition classification includes expert annotations and model predictions for 1947 cases. Note that no images or meta information are provided. The expert annotations come in the form of differential diagnoses, i.e., partial rankings of conditions, and there is a high level of disagreement among experts, making this a perfect benchmark for dealing with disagreement. The data has been introduced in [1] and [2].

3 papers0 benchmarksTexts

IIW-400 (ImageInWords: IIW-400)

Please refer: https://github.com/google/imageinwords/blob/main/datasets/IIW-400/README.md

3 papers0 benchmarksImages, Texts

CPsyCounD

The high-quality multi-turn dialogue dataset, which has a total of 3,134 multi-turn consultation dialogues. CPsyCounD covers nine representative topics and seven classic schools of psychological counseling.

3 papers0 benchmarksTexts

Mono3DRefer

We sample 2025 frames of images from the original KITTI for Mono3DRefer, containing 41,140 expressions in total and a vocabulary of 5,271 words.

3 papers0 benchmarks3D, Images, Texts

MULTI

MULTI-Benchmark is a cutting-edge benchmark for evaluating Multimodal Large Language Models (MLLMs). It is designed to test the understanding of complex tables and images, and reasoning with long context¹. Here are some key features of MULTI-Benchmark:

3 papers0 benchmarksImages, Texts

MUTE (Multimodal Bengali Hateful Memes Dataset)

MUTE This is the first open-source Bengali Hateful Meme dataset, consisting of around 4200 memes annotated with two labels: hate and not hate.

3 papers0 benchmarksImages, Texts

Ego4D-HCap

Ego4D-HCap is a hierarchical video captioning dataset comprised of a three-tier hierarchy of captions: short clip-level captions, medium-length video segment descriptions, and long-range video-level summaries. To construct Ego4D-HCap, we leverage Ego4D, the largest publicly available egocentric video dataset. While Ego4D comes with time-stamped atomic captions and video-segment descriptions spanning up to 5 minutes, it lacks video-level summaries for longer video durations. To address this issue, we annotate a subset of 8,267 Ego4D videos with long-range video summaries, each spanning up to two hours. This enhancement provides a three-level hierarchy of captions.

3 papers0 benchmarksTexts, Videos

SCapRepo (Google Play Screenshot Caption)

A screenshot-caption dataset containing 135k pairs of screenshots and captions extracted from Google Play.

3 papers0 benchmarksImages, Texts

MOMA-LRG (Multi-Object Multi-Actor activity parsing with Language-Refined Graphs)

A dataset dedicated to multi-object, multi-actor activity parsing.

3 papers4 benchmarksTexts, Videos

PreviousPage 83 of 158Next