3,148 machine learning datasets
3,148 dataset results
A test set of grammatically correct sentences and layouts (visual “imagined” situations), called Unexpected Situations of Common Objects in Context (USCOCO) describing compositions of entities and relations that are unlikely to be found in MS COCO.
Education is increasingly data-driven, and the ability to analyse and adapt educational materials quickly and effectively is important for keeping materials contemporary and interesting. These approaches also have the potential to personalise learning experiences. One of the challenges in this domain is aligning new literature with the appropriate educational stages. This dataset aims to contribute to alleviating this knowledge gap.
The WOS Hierarchical Text Classification are three dataset variants created from Web of Science (WOS) title and abstract data categorised into a hierarchical, multi-label class structure. The aim of the sampling and filtering methodology used was to create well-balanced class distributions (at chosen hierarchical levels). Furthermore, the WOS_JTF variant was also created with the aim to only contain publication data such that their class assignments results is classes instances that semantically more similar.
Although research on author profiling has quite progressed in abundant resources languages, it is still infancy for limited resources languages such as Bengali. This repository contains our Bangla Author Profiling Dataset (BN-AuthProf). The primary objective is to introduce and benchmark the performance of machine learning approaches on Age and Gender Classification tasks from the social media status of people.
A novel corpus of 15,356 Polish web articles, including articles identified as containing disinformation. Our dataset enables a multifaceted understanding of disinformation. We present a distinctive multilayered methodology for annotating disinformation in texts. What sets our corpus apart is its focus on uncovering hidden intent and manipulation in disinformative content. A team of experts annotated each article with multiple labels indicating both disinformation creators’ intents and the manipulation techniques employed.
Scene Graph Generation (SGG) converts visual scenes into structured graph representations, providing deeper scene understanding for complex vision tasks. However, existing SGG models often overlook essential spatial relationships and struggle with generalization in open-vocabulary contexts. To address these limitations, we propose LLaVA-SpaceSGG, a multimodal large language model (MLLM) designed for open-vocabulary SGG with enhanced spatial relation modeling. To train it, we collect the SGG instruction-tuning dataset, named SpaceSGG. This dataset is constructed by combining publicly available datasets and synthesizing data using open-source models within our data construction pipeline. It combines object locations, object relations, and depth information, resulting in three data formats: spatial SGG description, question-answering, and conversation. To enhance the transfer of MLLMs' inherent capabilities to the SGG task, we introduce a two-stage training paradigm. Experiments show that
CURE is a retrieval dataset with a monolingual and two cross-lingual conditions, with splits spanning ten medical domains. Queries in CURE are natural language questions formulated by healthcare providers. They express the information needs of practitioners consulting academic literature in the course of their duties. Queries are available in English, French and Spanish. The corpus is constructed by mining an index of english passages extracted from biomedical academic articles.
IllusionMNIST_test Dataset Characteristics IllusionMNIST_test is a generated dataset derived from the MNIST dataset. It introduces a novel element of pareidolia—a phenomenon where patterns, often faces, are perceived in random or abstract stimuli. The dataset contains 11 classes: the original 10 digits from MNIST, and an additional "No Illusion" class. It includes 1,219 samples, all synthetically created rather than real-world images.
IllusionFashionMNIST_test Dataset Characteristics IllusionFashionMNIST_test is a generated dataset derived from the FashionMNIST dataset. It incorporates the concept of pareidolia—a phenomenon where patterns, often faces, are perceived in random or abstract stimuli. The dataset contains 11 classes: the original 10 classes from FashionMNIST, and an additional "No Illusion" class. It includes 1,267 samples, all synthetically created rather than real-world images.
IllusionAnimals_test Dataset Characteristics IllusionAnimals_test is a generated dataset based on a synthetic collection of animal images, including 10 animal classes: cat, dog, pigeon, butterfly, elephant, horse, deer, snake, fish, and rooster. Additionally, it includes a "No Illusion" class, bringing the total number of classes to 11. The dataset contains 1,100 samples, all created synthetically rather than derived from real-world images.
IllusionChar_test Dataset Characteristics IllusionChar_test is a generated dataset containing 3,300 samples of images that feature sequences of 3 to 5 random characters. Unlike classification-focused datasets, this dataset is designed for tasks that require reasoning about patterns, sequences, or illusions within the character sequences. All images are synthetically generated, and no real-world data is included.
Lang-8 Preprocessed Dataset (for GED):
Pentachromatic Cultural Palette Dataset is characterized by unique cultural semantics and values. It is constructed through a carefully - designed multi - step data synthesis process based on PRISM dataset, focusing on the cultural perspectives of different continents (Africa, Asia, Europe, America, and Oceania).
📚 BlendNet The dataset contains $12k$ samples. To balance cost savings with data quality and scale, we manually annotated $2k$ samples and used GPT-4o to annotate the remaining $10k$ samples.
📚 CADBench CADBench is a comprehensive benchmark to evaluate the ability of LLMs to generate CAD scripts. It contains 500 simulated data samples and 200 data samples collected from online forums.
A dataset for training and testing tin various problem types and multi-turn Q&A scenarios, including a training set, test set, and test scripts.
Description
Description
MMCOMPOSITION is a high-quality benchmark specifically designed to comprehensively evaluate the compositionality of pre-trained Vision-Language Models (VLMs) across three main dimensions—VL compositional perception, reasoning, and probing—which are further divided into 13 distinct categories of questions. While previous benchmarks have mainly focused on text-to-image retrieval, single-choice questions, and open-ended text generation, MMCOMPOSITION introduces a more diverse and challenging set of 4,342 tasks covering both single-image and multi-image scenarios, as well as single-choice and indefinite-choice formats. This expanded range of tasks aims to capture the complex interplay between vision and language more effectively, surpassing earlier benchmarks such as ARO and Winoground by providing a more comprehensive and in-depth assessment of models’ cross-modal compositional capabilities.
FinDKG: The Global Financial Dynamic Knowledge Graph Dataset FinDKG is an open-source dataset focused on creating a temporally-resolved Financial Dynamic Knowledge Graph. Designed to bridge the gap in industry-specific knowledge graphs, particularly in the financial sector, FinDKG provides a high-touch, temporally-aware representation of global economic and market dynamics. This repository includes comprehensive details about the dataset, methodology, and schema, aiming to facilitate academic research and actionable insights in global financial markets.