19,997 machine learning datasets
19,997 dataset results
Recently, Text-to-Image (T2I) generation models have achieved significant advancements. Correspondingly, many automated metrics have emerged to evaluate the image-text alignment capabilities of generative models. However, the performance comparison among these automated metrics is limited by existing small datasets. Additionally, these datasets lack the capacity to assess the performance of automated metrics at a fine-grained level. In this study, we contribute an EvalMuse-40K benchmark, gathering 40K image-text pairs with fine-grained human annotations for image-text alignment-related tasks. In the construction process, we employ various strategies such as balanced prompt sampling and data re-annotation to ensure the diversity and reliability of our benchmark.
Data from the popular Chinese online shopping platform Taobao includes behaviors like buy, add-to-cart, add-to-favorite, and pageview. The buying behavior is considered the target behavior.
Event cameras are sensors that are inspired by biological systems and specialize in capturing changes in brightness. These emerging cameras offer numerous advantages over conventional frame-based cameras, including high dynamic range, high frame rates, and extremely low power consumption. As a result, event cameras are increasingly being used in various fields, such as object detection and tracking, autonomous driving, 3D reconstruction, visual odometry, and SLAM.
A test dataset that annotated articles in 2020 following the CoNLL-2003 NER task.
RoFT-chatgpt is a variation of RoFT dataset, where the same human prompts are continued with the gpt-3.5-turbo model. Each dataset sample consists of ten sentences, with the first part written by a human and the remainder completed by an LLM. Consequently, every sample has a boundary indicating the index of the sentence where authorship changes.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The Insider Threat Test Dataset is a collection of synthetic insider threat test datasets that provide both background and malicious actor synthetic data.
Replication Data for: Integrating Earth Observation Data into Causal Inference: Challenges and Opportunities
MaNGA is a component of the Fourth-Generation Sloan Digital Sky Survey whose goal is to map the detailed composition and kinematic structure of nearby galaxies. MaNGA uses integral field unit (IFU) spectroscopy to measure spectra for hundreds of points within each galaxy. MaNGA’s goal is to understand the “life history” of present-day galaxies from imprinted clues of their birth and assembly, through their ongoing growth via star formation and merging, to their death from quenching at late times.
A benchmark for Human-Human Interaction (HHI) recognition as free text.
U-DIADS-Bib is a proprietary dataset developed through the collaboration of computer scientists and humanities at the University of Udine. It is composed of 200 images, 50 for each of the 4 different manuscripts that characterize it. These handwritten books were selected in collaboration with humanist partners considering both the complexity of their layout and the presence of significant and semantically distinguishable elements. In particular, the images of the four manuscripts were collected from the digital library Gallica. All manuscripts are Latin and Syriac Bibles published between the 6th and 12th centuries A.D.
The ADP dataset consists of over 200,000 experimental crystal structures curated from the Cambridge Structural Database (CSD). It focuses on Anisotropic Displacement Parameters (ADPs), which describe atomic thermal vibrations within crystal lattices. ADPs provide insights into material properties such as thermal motion, heat capacity, vibrational entropy, and thermal expansion.
The dataset includes time-stamped user product reviews behavior from January, 2008 to October, 2018. Each user has a sequence of produce review events with each event containing the timestamp and category of the reviewed product, with each category corresponding to an event type.
The dataset has two years of user awards on a question-answering website: each user received a sequence of badges and there are 22 different kinds of badges in total.
The dataset contains historical financial transactions, including time, category and cost fields. There are 50000 clients, 205 categories and 43.7M events. The original goal was to predict the age group of the client. In this variant of the dataset, the goal is to forecast multiple future events.
The CPU dataset, first introduced by Rahimi and Recht (2007) and then used by Balog et al (2016).
About Dataset Step right up to our AI data collection company, where we’ve got something special just for you: a unique set of American Sign Language datasets! These datasets are carefully curated to give your AI projects a powerful boost.
This repository contains the database of the FEM simulation of axially impacted various configurations of the square crash boxes. This database records the impact of the structural and crash test parameters on the various crashworthiness objectives.
145k natural language and PDDL problem pairs from the Blocks World, Gripper, and Floor Tile domains.
To take advantage of the ever-increasing amount of structural data now available, we also trained Paragraph on a larger dataset. This new dataset was extracted from the Structural Antibody Database (SAbDab, Schneider et al., 2022) on March 31, 2022 and includes 1086 complexes which we divide into train, validation and test sets using a 60-20-20 split. Full details of both datasets are given in the Supplementary Information.