Datasets

19,997 machine learning datasets

19,997 dataset results

Szeged Corpus

The Szeged Treebank is the largest fully manually annotated treebank of the Hungarian language. It contains 82,000 sentences, 1.2 million words and 250,000 punctuation marks. Texts were selected from six different domains, ~200,000 words in size from each. The domains are the following:

4 papers0 benchmarksTexts

CompMix

CompMix is a crowdsourced QA benchmark which naturally demands the integration of a mixture of input sources. CompMix has a total of 9,410 questions, and features several complex intents like joins and temporal conditions.

4 papers0 benchmarksTexts

DISCO-10M

DISCO-10M is a novel and extensive music dataset that surpasses the largest previously available music dataset by an order of magnitude.

4 papers0 benchmarksAudio

Climabench

The topic of Climate Change (CC) has received limited attention in NLP despite its real world urgency. Activists and policy-makers need NLP tools in order to effectively process the vast and rapidly growing textual data produced on CC. Their utility, however, primarily depends on whether the current state-of-the-art models can generalize across various tasks in the CC domain. In order to address this gap, we introduce Climate Change Benchmark (ClimaBench), a benchmark collection of existing disparate datasets for evaluating model performance across a diverse set of CC NLU tasks systematically. Further, we enhance the benchmark by releasing two large-scale labelled text classification and question-answering datasets curated from publicly available environmental disclosures. Lastly, we provide an analysis of several generic and CC-oriented models answering whether fine-tuning on domain text offers any improvements across these tasks. We hope this work provides a standard assessment tool

4 papers2 benchmarks

TRansPose

TRansPose is a large-scale multispectral dataset that combines stereo RGB-D, TIR (TIR) images, and object poses to promote transparent object research. The dataset includes 99 transparent objects, encompassing 43 household items, 27 recyclable trashes, 29 chemical laboratory equivalents, and 12 non-transparent objects. It comprises a vast collection of 333,819 images and 4,000,056 annotations, providing instance-level segmentation masks, ground-truth poses, and completed depth information.

4 papers0 benchmarksRGB-D

RidgeBase (RidgeBase: A Cross-Sensor Multi-Finger Contactless Fingerprint Dataset)

Contactless fingerprint matching using smartphone cameras can alleviate major challenges of traditional fingerprint systems including hygienic acquisition, portability and presentation attacks. However, development of practical and robust contactless fingerprint matching techniques is constrained by the limited availability of large scale real-world datasets. To motivate further advances in contactless fingerprint matching across sensors, we introduce the RidgeBase benchmark dataset. RidgeBase consists of more than 15,000 contactless and contact-based fingerprint image pairs acquired from 88 individuals under different background and lighting conditions using two smartphone cameras and one flatbed contact sensor. Unlike existing datasets, RidgeBase is designed to promote research under different matching scenarios that include Single Finger Matching and Multi-Finger Matching for both contactless-to-contactless (CL2CL) and contact-to-contactless (C2CL) verification and identification. F

4 papers0 benchmarksImages

SOD4SB (Small Object Detection for Spotting Birds)

The Small Object Detection for Spotting Birds (SOD4SB) dataset is a dataset consisting of 39,070 images including 137,121 bird instances. The SOD4SD dataset contains a wide variety of small bird types and a variety of scenes.

4 papers0 benchmarksImages

grobid-quantities-holdout (grobid-quantities holdout dataset)

The dataset is described here:

4 papers0 benchmarks

Human-M3

Human-M3 is an outdoor multi-modal multi-view multi-person human pose database which includes not only multi-view RGB videos of outdoor scenes but also corresponding pointclouds.

4 papers0 benchmarksLiDAR, Point cloud

ETHEC (ETH Entomological Collection (ETHEC) Dataset)

It includes 47,978 butterfly images with a 4-level label-hierarchy. Hierarchy of labels from the ETHEC dataset across 4 levels: family, sub-family, genus and species. 6 family -> 21 sub-family -> 135 genus -> 561 species

4 papers0 benchmarksImages

MDOT

Description The consists of 92 groups of video clips with 113, 918 high resolution frames taken by two drones and 63 groups of video clips with 145, 875 high resolution frames taken by three drones.

4 papers0 benchmarksImages

ALTA 2021 Shared Task (Automatic Grading of Evidence, 10 years later)

This dataset is described in the ALTA 2021 Shared Task website and associated CodaLab competition.

4 papers0 benchmarksTexts

GVLM (Global Very-High-Resolution Landslide Mapping)

For change detection tasks, current open-source datasets mainly focus on building extraction (e.g., WHU building dataset and LEVIR-CD dataset) (Chen and Shi, 2020; Ji et al., 2018) and urban development monitoring (e.g., SECOND dataset, Google dataset and CDD dataset) (Yang et al., 2022; Peng et al., 2021; Lebedev et al., 2018), whereas datasets for natural disaster monitoring have been seldom investigated.

4 papers1 benchmarksImages

DeepPatent

The dataset consists of over 350,000 public domain patent drawings collected from the United States Patent and Trademark Office (USPTO). The whole collection consists of a total of 45,000 design patents published between January 2018 and June 2019.

4 papers1 benchmarksImages

CommitChronicle

CommitChronicle is a dataset for commit message generation (and/or completion).

4 papers0 benchmarksTexts

CrashD

CrashD is a test benchmark for the robustness and generalization of 3D object detection models. It contains a wide range of out-of-distribution vehicles, including damaged, classic, and sports cars.

4 papers0 benchmarks

WorldView-3 PAirMax

The PAirMax dataset is a collection of images for evaluating the performance of pansharpening algorithms. This data collection includes nine test cases at full resolution, acquired by different sensors belonging to Maxar's constellation of high-resolution satellites. Nine related test cases at reduced resolution, simulated according to Wald’s protocol, are also included. In particular, this dataset refers to the three images acquired by the WorldView-3 satellite, representing Munich.

4 papers4 benchmarks

UniProtQA

UniProtQA consists of proteins and textual queries about their functions and properties. The dataset is constructed from UniProt, and consists 4 types of questions with regard to functions, official names, protein families, and sub-cellular locations. We collect a total of 569, 516 proteins and 1, 891, 506 question-answering samples.

4 papers6 benchmarks

OVDEval

OVDEval includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input.

4 papers0 benchmarksImages, Texts

DiaASQ (Conversational Aspect-based Sentiment Quadruple Extraction)

DiaASQ is a fine-grained Aspect-based Sentiment Analysis (ABSA) benchmark under the conversation scenario. It challenges existing ABSA methods by 1) extracting quadruple of target-aspect-opinion-sentiment in a dialogue, and 2) modeling the dialogue discourse structures. The dataset is constructed by systematically crawling tweets from digital bloggers, followed by a series of preprocessing steps including filtering, normalizing, pruning, and annotating the collected dialogues, resulting in a final corpus of 1,000 dialogues. To enhance the multilingual usability, DiaASQ has both the English and Chinese versions of languages.

4 papers0 benchmarksTexts

PreviousPage 250 of 1000Next