3,148 machine learning datasets
3,148 dataset results
GQNLI-FR is a manually translated French version of the GQNLI challenge dataset, originally written in English.
PatternCom is a composed image retrieval benchmark based on PatternNet. PatternNet is a large-scale high-resolution remote sensing image retrieval dataset. There are 38 classes and each class has 800 images of size 256×256 pixels. In PatternCom, we select some classes to be depicted in query images, and add a query text that defines an attribute relevant to that class. For instance, query images of “swimming pools” are combined with text queries defining “shape” as “rectangular”, “oval”, and “kidney-shaped”. In total, PatternCom includes six attributes consisted of up to four different classes each. Each attribute can be associated with two to five values per class. The number of positives ranges from 2 to 1345 and there are more than 21k queries in total.
Introduction This dataset was gathered during the Vid2RealHRI study of humans’ perception of robots' intelligence in the context of an incidental Human-Robot encounter. The dataset contains participants' questionnaire responses to four video study conditions, namely Baseline, Verbal, Body language, and Body language + Verbal. The videos depict a scenario where a pedestrian incidentally encounters a quadruped robot trying to enter a building. The robot uses verbal commands or body language to try to ask for help from the pedestrian in different study conditions. The differences in the conditions were manipulated using the robot’s verbal and expressive movement functionalities.
The Linguistic Benchmark (JSON), consisting of 30 questions was developed to be easy for human adults to answer but challenging for LLMs. It is designed to assess the well-documented limitations of LLMs across domains such as spatial reasoning, linguistic understanding, relational thinking, mathematical reasoning, knowledge of basic scientific concepts, and common sense. This benchmark is a useful tool to gauge the current capabilities capabilities of LLMs. The questions serve as a linguistic benchmark to examine model performance in several key domains where they have known limitations.
The MoToMQA (Multi-Order Theory of Mind Question & Answer) benchmark is a test suite introduced to examine the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner¹.
ACCORD CSQA is an extension of the popular CommonsenseQA (CSQA) dataset using ACCORD, a scalable framework for disentangling the commonsense grounding and reasoning abilities of large language models (LLMs) through controlled, multi-hop counterfactuals. ACCORD closes the measurability gap between commonsense and formal reasoning tasks for LLMs. A detailed understanding of LLMs' commonsense reasoning abilities is severely lagging compared to our understanding of their formal reasoning abilities, since commonsense benchmarks are difficult to construct in a manner that is rigorously quantifiable. Specifically, prior commonsense reasoning benchmarks and datasets are limited to one- or two-hop reasoning or include an unknown (i.e., non-measurable) number of reasoning hops and/or distractors. Arbitrary scalability via compositional construction is also typical of formal reasoning tasks but lacking in commonsense reasoning. Finally, most prior commonsense benchmarks either are limited to a si
This dataset contains prompts designed to evaluate and challenge the safety mechanisms of generative text-to-image models, with a particular focus on identifying prompts that are likely to produce images containing nudity. Introduced in the 2024 ICML paper Prompting4Debugging: Red-Teaming Text-to-Image Diffusion Models by Finding Problematic Prompts, this dataset is not specific to any single approach or model but is intended to test various mitigating measures against inappropriate content generation in models like Stable Diffusion. This dataset is only for research purposes.
BS-Objaverse 660k Dataset is a set of GPT4-Vision-powered multi-modal captions data. It is constructed to enhance modality alignment and fine-grained visual concept perception for describing detailed information about the shape, texture of Objaverse 3D object.
Dataset composed of two main parts 1. Material characterization of a metal (from lab test)
Download free fonts in DaFont style from our extensive collection. Find bold, italic, cursive, futuristic fonts, and more. Enhance your projects with unique and stylish typography today!
A human-refined dataset of OpenAPI definitions based on the APIs.guru OpenAPI directory.
Useful for checking the physical reasoning capabilities in household agents. Made through human annotations of what type of object configurations we prefer for accomplishing a household task.
RoomSpace: a new benchmark designed to evaluate language models on spatial reasoning tasks demanding spatial relation knowledge and multi-hop reasoning. RoomSpace encompasses a comprehensive range of qualitative spatial relationships, including topological, directional, and distance relations. These relationships are presented from various viewpoints, with differing levels of granularity and density of relational constraints to simulate real-world complexities. This approach promotes a more accurate assessment of language models' capabilities in spatial reasoning tasks.
Large Language Models (LLMs) have the potential to enhance Agent-Based Modeling by better representing complex interdependent cybersecurity systems, improving cybersecurity threat modeling and risk management. Evaluating LLMs in this context is crucial for legal compliance and effective application development. Existing LLM evaluation frameworks often overlook the human factor and cognitive computing capabilities essential for interdependent cybersecurity. To address this gap, I propose OllaBench, a novel evaluation framework that assesses LLMs' accuracy, wastefulness, and consistency in answering scenario-based information security compliance and non-compliance questions.
Mediapi-RGB is a bilingual corpus of French Sign Language (LSF) and written French in the form of subtitled videos, accompanied by complementary data (various representations, segmentation, vocabulary, etc.). It can be used in academic research for a wide range of tasks, such as training or evaluating sign language (SL) extraction, recognition or translation models.
Dataset Description EUROPA is a dataset designed for training and evaluating multilingual keyphrase generation models in the legal domain. It consists of legal judgments from the Court of Justice of the European Union (EU) and includes instances in all 24 official EU languages.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Video-ChatGPT introduces the VideoInstruct100K dataset, which employs a semi-automatic annotation pipeline to generate 75K instruction-tuning QA pairs. To address the limitations of this annotation process, we present \ourdata~dataset developed through an improved annotation pipeline. Our approach improves the accuracy and quality of instruction tuning pairs by improving keyframe extraction, leveraging SoTA large multimodal models (LMMs) for detailed descriptions, and refining the instruction generation strategy.
We present the World Wide Dishes dataset which seeks to assess disparities in representations of food through a decentralised data collection effort to gather perspectives directly from people with a wide variety of backgrounds from around the globe with the aim of creating a dataset consisting of their insights into their own experiences of foods relevant to their cultural, regional, national, or ethnic lives.
This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images.