Datasets

3,148 machine learning datasets

3,148 dataset results

USCOCO (Unexpected Situations of Common Objects in Context)

A test set of grammatically correct sentences and layouts (visual “imagined” situations), called Unexpected Situations of Common Objects in Context (USCOCO) describing compositions of entities and relations that are unlikely to be found in MS COCO.

1 papers0 benchmarksTexts

UK Key Stage Readability (UK Key Stage Readability for English Texts)

Education is increasingly data-driven, and the ability to analyse and adapt educational materials quickly and effectively is important for keeping materials contemporary and interesting. These approaches also have the potential to personalise learning experiences. One of the challenges in this domain is aligning new literature with the appropriate educational stages. This dataset aims to contribute to alleviating this knowledge gap.

1 papers2 benchmarksTexts

WOS Hierarchical Text Classification

The WOS Hierarchical Text Classification are three dataset variants created from Web of Science (WOS) title and abstract data categorised into a hierarchical, multi-label class structure. The aim of the sampling and filtering methodology used was to create well-balanced class distributions (at chosen hierarchical levels). Furthermore, the WOS_JTF variant was also created with the aim to only contain publication data such that their class assignments results is classes instances that semantically more similar.

1 papers0 benchmarksTexts

BN-AuthProf (Bangla Author Profiling Dataset)

Although research on author profiling has quite progressed in abundant resources languages, it is still infancy for limited resources languages such as Bengali. This repository contains our Bangla Author Profiling Dataset (BN-AuthProf). The primary objective is to introduce and benchmark the performance of machine learning approaches on Age and Gender Classification tasks from the social media status of people.

1 papers6 benchmarksTexts

MIPD (Manipulation and Intention In a Novel Corpus of Polish Disinformation)

A novel corpus of 15,356 Polish web articles, including articles identified as containing disinformation. Our dataset enables a multifaceted understanding of disinformation. We present a distinctive multilayered methodology for annotating disinformation in texts. What sets our corpus apart is its focus on uncovering hidden intent and manipulation in disinformative content. A team of experts annotated each article with multiple labels indicating both disinformation creators’ intents and the manipulation techniques employed.

1 papers0 benchmarksTexts

SpaceSGG

Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs' inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that

1 papers0 benchmarksImages, Texts

CURE (A dataset for Clinical Understanding & Retrieval Evaluation)

CURE is a retrieval dataset with a monolingual and two cross-lingual conditions, with splits spanning ten medical domains. Queries in CURE are natural language questions formulated by healthcare providers. They express the information needs of practitioners consulting academic literature in the course of their duties. Queries are available in English, French and Spanish. The corpus is constructed by mining an index of english passages extracted from biomedical academic articles.

1 papers0 benchmarksTexts

IllusionMNIST_test

IllusionMNIST_test Dataset Characteristics IllusionMNIST_test is a generated dataset derived from the MNIST dataset. It introduces a novel element of pareidolia—a phenomenon where patterns, often faces, are perceived in random or abstract stimuli. The dataset contains 11 classes: the original 10 digits from MNIST, and an additional "No Illusion" class. It includes 1,219 samples, all synthetically created rather than real-world images.

1 papers0 benchmarksImages, Texts

IllusionFashionMNIST_test

IllusionFashionMNIST_test Dataset Characteristics IllusionFashionMNIST_test is a generated dataset derived from the FashionMNIST dataset. It incorporates the concept of pareidolia—a phenomenon where patterns, often faces, are perceived in random or abstract stimuli. The dataset contains 11 classes: the original 10 classes from FashionMNIST, and an additional "No Illusion" class. It includes 1,267 samples, all synthetically created rather than real-world images.

1 papers0 benchmarksImages, Texts

IllusionAnimals_test

IllusionAnimals_test Dataset Characteristics IllusionAnimals_test is a generated dataset based on a synthetic collection of animal images, including 10 animal classes: cat, dog, pigeon, butterfly, elephant, horse, deer, snake, fish, and rooster. Additionally, it includes a "No Illusion" class, bringing the total number of classes to 11. The dataset contains 1,100 samples, all created synthetically rather than derived from real-world images.

1 papers0 benchmarksImages, Texts

IllusionChar_test

IllusionChar_test Dataset Characteristics IllusionChar_test is a generated dataset containing 3,300 samples of images that feature sequences of 3 to 5 random characters. Unlike classification-focused datasets, this dataset is designed for tasks that require reasoning about patterns, sequences, or illusions within the character sequences. All images are synthetically generated, and no real-world data is included.

1 papers0 benchmarksImages, Texts

Cleaned_Lang8

Lang-8 Preprocessed Dataset (for GED):

1 papers0 benchmarksTexts

Pentachromatic Cultural Palette Dataset

Pentachromatic Cultural Palette Dataset is characterized by unique cultural semantics and values. It is constructed through a carefully - designed multi - step data synthesis process based on PRISM dataset, focusing on the cultural perspectives of different continents (Africa, Asia, Europe, America, and Oceania).

1 papers0 benchmarksTexts

BlendNet

📚 BlendNet The dataset contains $12k$ samples. To balance cost savings with data quality and scale, we manually annotated $2k$ samples and used GPT-4o to annotate the remaining $10k$ samples.

1 papers0 benchmarks3D, 3d meshes, Cad, Texts

CADBench

📚 CADBench CADBench is a comprehensive benchmark to evaluate the ability of LLMs to generate CAD scripts. It contains 500 simulated data samples and 200 data samples collected from online forums.

1 papers0 benchmarks3D, 3d meshes, Cad, Texts

MMSQL (Multi-Turn Multi-Type Text-to-SQL test suit)

A dataset for training and testing tin various problem types and multi-turn Q&A scenarios, including a training set, test set, and test scripts.

1 papers2 benchmarksTexts

ConsisID-preview-Data

Description

1 papers0 benchmarksTexts, Videos

ChronoMagic-ProH

Description

1 papers0 benchmarksTexts, Videos

MMComposition

MMCOMPOSITION is a high-quality benchmark specifically designed to comprehensively evaluate the compositionality of pre-trained Vision-Language Models (VLMs) across three main dimensions—VL compositional perception, reasoning, and probing—which are further divided into 13 distinct categories of questions. While previous benchmarks have mainly focused on text-to-image retrieval, single-choice questions, and open-ended text generation, MMCOMPOSITION introduces a more diverse and challenging set of 4,342 tasks covering both single-image and multi-image scenarios, as well as single-choice and indefinite-choice formats. This expanded range of tasks aims to capture the complex interplay between vision and language more effectively, surpassing earlier benchmarks such as ARO and Winoground by providing a more comprehensive and in-depth assessment of models’ cross-modal compositional capabilities.

1 papers0 benchmarksImages, Texts

Financial Dynamic Knowledge Graph

FinDKG: The Global Financial Dynamic Knowledge Graph Dataset FinDKG is an open-source dataset focused on creating a temporally-resolved Financial Dynamic Knowledge Graph. Designed to bridge the gap in industry-specific knowledge graphs, particularly in the financial sector, FinDKG provides a high-touch, temporally-aware representation of global economic and market dynamics. This repository includes comprehensive details about the dataset, methodology, and schema, aiming to facilitate academic research and actionable insights in global financial markets.

1 papers0 benchmarksFinancial, Graphs, Texts

PreviousPage 144 of 158Next