Datasets

19,997 machine learning datasets

19,997 dataset results

StreaksYoloDataset: labeled raw astronomical images for streaks detection

StreaksYoloDataset, is a set of raw astronomical images captured with smart telescopes and annotated with the positions of streaks that are effectively in the images. Images were captured between March 2022 and February 2023 from Luxembourg Greater Region by using the built-in alignment and stacking features of a Stellina smart telescope, based on an Extra Low Dispersion doublet with an aperture of 80 mm and a focal length of 400 mm (focal ratio of f/5), and equipped with a Sony IMX178 CMOS sensor with a resolution of 6.4 million pixels.

2 papers0 benchmarks

NeuroVoz (NeuroVoz: a Castillian Spanish corpus of parkinsonian speech)

The NeuroVoz dataset emerges as a pioneering resource in the field of computational linguistics and biomedical research, specifically designed to enhance the diagnosis and understanding of Parkinson's Disease (PD) through speech analysis. This dataset is distinguished as the first of its kind to be made publicly available in Castilian Spanish, addressing a critical gap in the availability of linguistic and dialectical diversity within PD research.

2 papers0 benchmarksAudio, Speech

BASEPROD (The Bardenas Semi-Desert Planetary Rover Dataset)

BASEPROD provides comprehensive rover sensor data collected over a 1.7 km traverse, accompanied by high-resolution 2D and 3D drone maps of the terrain. The dataset also includes laser-induced breakdown spectroscopy (LIBS) measurements from key sampling sites along the rover's path, as well as weather station data to contextualize environmental conditions.

2 papers0 benchmarks3D, Environment, Images, Point cloud, RGB-D, Stereo, Tabular, Time series

DenseUAV

DenseUAV is a dataset of drone and satellite perspectives collected from 14 universities in low-altitude urban scenes. The main features include real scene sampling, sampling perspective perpendicular to the ground, and dense sampling. A total of 3033 sampling points, including 9099 drone perspective images and 18198 satellite perspective images.

2 papers0 benchmarksImages

WildDESED (Wild Domestic Environment Sound Event Detection)

WildDESED is an extension of the original DESED dataset, created to reflect various domestic scenarios by incorporating complex and unpredictable background noises. These enhancements make WildDESED a powerful resource for developing and evaluating noise-robust SED systems.

2 papers5 benchmarksAudio, Texts

In-house

A Point Cloud Dataset for place recognition provided by PointNetVLAD, please refer to the URL

2 papers0 benchmarks

FFHNet (FFHNet Dexterous Grasping Dataset)

https://syncandshare.lrz.de/getlink/fi9EZb33KiSAJ5rLHAkhg7/ffhnet-data.zip

2 papers0 benchmarks

CARWC (Consolidated and refined world cup dataset)

Consolidates the world cup 2014 (WC14) and time-series world cup (TSWC) datasets and refines their homography annotations.

2 papers0 benchmarks

SQL-Eval

SQL-Eval is an open-source PostgreSQL evaluation dataset released by Defog, constructed based on Spider. The original link can be found at https://github.com/defog-ai/sql-eval. Our evaluation methodology is more stringent, as it compares the execution accuracy of the predicted SQL queries against the sole ground truth SQL query.

2 papers2 benchmarksTexts

ROPE (Recognition-based Object Probing Evaluation)

We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. Different types of instruction settings of ROPE. In a single turn of prompting without format enforcement, we probe the model to recognize the 5 objects referred to by the visual prompts (a) one at a time in the single-object setting and (b) concurrently in the multi-object setting. We further enforce the model to follow the format template and decode only the object tokens for each of the five objects (c) without output manipulation in student forcing and (d) replacing all previously generated object tokens with the ground truth classes in teacher forcing.

2 papers0 benchmarksImages, Texts

RePAIR Dataset

Our dataset consists of over 1000 fractured frescoes. The RePAIR stands as a realistic computational challenge for methods for 2D and 3D puzzle solving, and serves as a benchmark that enables the study of fractured object reassembly and presents new challenges for geometric shape understanding. Please visit our website for more dataset information, access to source code scripts and for an interactive gallery viewing of the dataset samples.

2 papers0 benchmarks3D, Images

https://github.com/GenImage-Dataset/GenImage

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

CLCIFAR10

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

MolParser-7M

A large scale OCSR dataset, proposed in paper “MolParser: End-to-end Visual Recognition of Molecule Structures in the Wild“ MolParser-7M contains nearly 8 million paired image-SMILES data. It should be noted that the caption of image is extended-SMILES format proposed in paper.

2 papers0 benchmarksImages, Texts

RHM (Rhm: Robot house multi-view human activity recognition dataset)

The Robot House Multi-View dataset (RHM) contains four views: Front, Back, Ceiling, and Robot Views. There are 14 classes with 6701 video clips for each view, making a total of 26804 video clips for the four views. The lengths of the video clips are between 1 to 5 seconds. The videos with the same number and the same classes are synchronized in different views.

2 papers3 benchmarksActions, Images, RGB Video, Videos

DISC-Law-SFT

DISC-Law-SFT comprises two subsets, DISC-Law-SFT-Pair and DISC-Law-SFT-Triplet. The former aims to introduce legal reasoning abilities to the LLM, while the latter helps enhance the model's capability to utilize external legal knowledge.

2 papers0 benchmarksTexts

Yo'LLaVA

40 personalized concepts

2 papers0 benchmarksImages, Texts

Wallhack1.8k

The Wallhack1.8k dataset comprises 1,806 CSI amplitude spectrograms (and raw WiFi packet time series) corresponding to three activity classes: "no presence," "walking," and "walking + arm-waving." WiFi packets were transmitted at a frequency of 100 Hz, and each spectrogram captures a temporal context of approximately 4 seconds (400 WiFi packets).

2 papers0 benchmarksTime series

ADORE (A benchmark dataset for machine learning in ecotoxicology)

ADORE is a benchmark dataset for machine learning for ecotixicology, covering acute aquatic toxicity in three relevant taxonomic groups (fish, crustaceans, and algae). The core dataset describes ecotoxicological experiments and is expanded with phylogenetic and species-specific data on the species as well as chemical properties and molecular representations. Apart from challenging other researchers to try and achieve the best model performances across the whole dataset, we propose specific relevant challenges on subsets of the data and include datasets and splittings corresponding to each of these challenge as well as in-depth characterization and discussion of train-test splitting approaches.

2 papers2 benchmarksBiology, Environment

NBA

NBA: This is extended from a Kaggle dataset * containing around 400 NBA basketball players. The performance statistics of players in the 2016-2017 season and other various information e.., nationality, age, and salary are provided. To obtain the graph that links the NBA players together, we collect the relationships of the NBA basketball players on Twitter with its official crawling API 2. We binarize the nationality to two categories, i.e., U.S. players and oversea players, which is used as sensitive attribute. The classification task is to predict whether the salary of the player is over median.

2 papers1 benchmarksGraphs

PreviousPage 355 of 1000Next