19,997 machine learning datasets
19,997 dataset results
The GoogleEarth dataset is collected from Google Earth Studio, including 400 orbit trajectories in Manhattan and Brooklyn. Each trajectory consists of 60 images, with orbit radiuses ranging from 125 to 813 meters and altitudes varying from 112 to 884 meters. In addition to the images, Google Earth Studio provides camera intrinsic and extrinsic parameters, making it possible to create automated annotations for semantic and building instance segmentation
The dataset offers tag and mask annotations for image-text pairs from the CC3M validation set. Tag annotations denote words that aptly describe the relationship between the image and the corresponding text. These annotations provide valuable insights into the semantic connection between each pair's visual and textual elements.
MMFlood is remote sensing dataset derived from Sentinel-1 (VV-VH), MapZen (DEM) and OpenStreetMap (Hydrography). It provides a complete and well-rounded set of data specifically designed for flood events, focusing on three main features: worldwide distribution, manually validated annotations and multiple modalities.
AART serves as an automated alternative to the current manual red-teaming efforts. The primary goal is to evaluate the safety of LLM generations in various application contexts.
ChronoMagic with 2265 metamorphic time-lapse videos, each accompanied by a detailed caption.
FAUST-partial is a 3D registration benchmark dataset created to provide a more informative evaluation of 3D registration methods. The dataset addresses two main limitations of current 3D registration benchmarks:
We introduce a new dataset for form structure understanding and key information extraction. This repository will provide detailed baseline model descriptions and experimental setups to ensure our model and experiments are reproducible. As well as we will offer a colab link to show how to download and use our dataset for corresponding tasks.
Aria Synthetic Environments is a large-scale, fully simulated dataset created by Project Aria¹1. It consists of procedurally-generated interior layouts filled with 3D objects, simulated with the sensor characteristics of Aria glasses¹1. Here are some key features of this dataset:
The BEHAVIOR-1K dataset is a comprehensive simulation benchmark for human-centered robotics¹. It is more grounded on actual human needs compared to its predecessor, BEHAVIOR-100¹. The 1,000 activities in the dataset come from the results of an extensive survey on "what do you want robots to do for you?"¹.
The FineWeb dataset consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. The data processing pipeline is optimized for LLM performance and runs on the datatrove library, our large-scale data processing library.
Large, multimodal biometric dataset: It contains still images and videos of over 1,000 people captured at various ranges (up to 1,000 meters) and elevations (up to 400 meters) using a diverse set of cameras (commercial, military-grade, specialized).
Manual crown delineation of individual trees in two countries: Denmark and Finland.
RTL-Repo is a benchmark for evaluating LLMs' effectiveness in generating Verilog code autocompletions within large, complex codebases. It assesses the model's ability to understand and remember the entire Verilog repository context and generate new code that is correct, relevant, logically consistent, and adherent to coding conventions and guidelines, while being aware of all components and modules in the project. This provides a realistic evaluation of a model's performance in real-world RTL design scenarios. RTL-Repo comprises over 4000 code samples from GitHub repositories, each containing the context of all Verilog code in the repository, offering a valuable resource for the hardware design community to assess and train LLMs for Verilog code generation in complex, multi-file RTL projects.
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
OoDIS is a benchmark dataset for anomaly instance segmentation, crucial for autonomous vehicle safety. It extends existing anomaly segmentation benchmarks to focus on the segmentation of individual out-of-distribution (OOD) objects.
The SUGARCREPE++ dataset evaluates the sensitivity of vision language models (VLMs) and unimodal language models (ULMs) to semantic and lexical alterations. Each sample in the SugarCrepe++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. The SUGARCREPE dataset consists of (only) one positive and one hard negative caption for each image. Relative to the negative caption, a single positive caption can either have low or high lexical overlap. The original SUGARCREPE only captures the high overlap case. To evaluate the sensitivity of encoded semantics to lexical alteration, we require an additional positive caption with a different lexical composition. SUGARCREPE++ fills this gap by adding an additional positive caption enabling a more thorough assessment of models’ abilities to handle se
OlympicArena is a benchmark to evaluate advanced capabilities of language models across a broad spectrum of Olympic-level challenges.
This is the dataset used by the automatic sparse attention compression method MoA. It enhances the calibration dataset by integrating long-range dependencies and model alignment. MoA utilizes long-contextual datasets, which include question-answer pairs heavily dependent on long-range content.
NaturalCodeBench (NCB) is a comprehensive code benchmark designed to mirror the complexity and variety of scenarios in real coding tasks¹². It comprises 402 high-quality problems in Python and Java, meticulously selected from an online coding service, covering 6 different domains¹².
WiGesture dataset contains data related to gesture recognition and people id identification in a meeting room scenario. The dataset provides synchronised CSI, RSSI, and timestamp for each sample. It can be used for research on WiFi-based human gesture recognition and people id identification.