Datasets

19,997 machine learning datasets

19,997 dataset results

BUG

BUG is a large-scale gender bias dataset of 108K diverse real-world English sentences, sampled semiautomatically from large corpora using lexical syntactic pattern matching

17 papers0 benchmarksTexts

Molecule3D is a new benchmark that includes a dataset with precise ground-state geometries of approximately 4 million molecules derived from density functional theory (DFT). It also provides a set of software tools for data processing, splitting, training, and evaluation, etc.

17 papers0 benchmarks

SOSD (Searching on Sorted Data)

SOSD is a collection of dataset to benchmark the lookup performance of learned indexes.

17 papers0 benchmarks

CUTE80

The CUTE80 dataset is a lightweight collection of images specifically designed for text detection in natural scene images. It contains a total of 13,000 annotated page images across five different popular categories: 1) Table 2) Figure 3) Natural image 4) Logo 5) ignature

17 papers3 benchmarks

Sachs (Sachs Protein Dataset)

Sachs dataset measures the expression level of different proteins and phospholipids in human cells. It includes the simultaneous measurements of 11 phosphorylated proteins and phospholipids derived from thousands of individual primary immune system cells, subjected to both general and specific molecular interventions.

17 papers0 benchmarks

SUN-SEG-Easy (Unseen)

The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.

17 papers7 benchmarksMedical, RGB Video, Videos

SUN-SEG-Hard (Unseen)

17 papers7 benchmarksMedical, RGB Video, Videos

ViViD++ (Vision for Visibility Dataset)

A dataset capturing diverse visual data formats that target varying luminance conditions, and was recorded from alternative vision sensors, by handheld or mounted on a car, repeatedly in the same space but in different conditions.

17 papers0 benchmarks3D, Images, LiDAR, RGB Video, RGB-D

PCQM4Mv2-LSC

PCQM4Mv2 is a quantum chemistry dataset originally curated under the PubChemQC project. Based on the PubChemQC, we define a meaningful ML task of predicting DFT-calculated HOMO-LUMO energy gap of molecules given their 2D molecular graphs. The HOMO-LUMO gap is one of the most practically-relevant quantum chemical properties of molecules since it is related to reactivity, photoexcitation, and charge transport. Moreover, predicting the quantum chemical property only from 2D molecular graphs without their 3D equilibrium structures is also practically favorable. This is because obtaining 3D equilibrium structures requires DFT-based geometry optimization, which is expensive on its own.

17 papers2 benchmarks

Memento10k

Memorability dataset with 10000 3-second videos. Each video has upwards of 90 human annotations, and the split-half consistency of this dataset is 0.73 (best in class for video memorabilty datasets).

17 papers0 benchmarksImages, RGB Video, Videos

OpenXAI

OpenXAI is the first general-purpose lightweight library that provides a comprehensive list of functions to systematically evaluate the quality of explanations generated by attribute-based explanation methods. OpenXAI supports the development of new datasets (both synthetic and real-world) and explanation methods, with a strong bent towards promoting systematic, reproducible, and transparent evaluation of explanation methods.

17 papers0 benchmarksTabular

WHU Building Dataset

We manually edited an aerial and a satellite imagery dataset of building samples and named it a WHU building dataset. The aerial dataset consists of more than 220, 000 independent buildings extracted from aerial images with 0.075 m spatial resolution and 450 km2 covering in Christchurch, New Zealand. The satellite imagery dataset consists of two subsets. One of them is collected from cities over the world and from various remote sensing resources including QuickBird, Worldview series, IKONOS, ZY-3, etc. The other satellite building sub-dataset consists of 6 neighboring satellite images covering 550 km2 on East Asia with 2.7 m ground resolution.

17 papers4 benchmarks

SPICE (Small-Molecule/Protein Interaction Chemical Energies)

SPICE is a collection of quantum mechanical data for training potential functions. The emphasis is particularly on simulating drug-like small molecules interacting with proteins. It is designed to achieve the following goals:

17 papers0 benchmarks

EgoTaskQA

EgoTask QA benchmark contains 40K balanced question-answer pairs selected from 368K programmatically generated questions generated over 2K egocentric videos. It provides a single home for the crucial dimensions of task understanding through question-answering on real-world egocentric videos.

17 papers1 benchmarksVideos

Chameleon(60%/20%/20% random splits)

Node classification on Chameleon with 60%/20%/20% random splits for training/validation/test.

17 papers2 benchmarksGraphs

Laval Indoor HDR Dataset

This dataset contains 2100+ high resolution indoor panoramas, captured using a Canon 5D Mark III and a robotic panoramic tripod head. Each capture was multi-exposed (22 f-stops) and is fully HDR, without any saturation. Panoramas were stitched from 6 captures (60 degrees azimuth increment) and were captured in a wide variety of indoor environments.

17 papers0 benchmarksImages

MMDialog

MMDialog is a large-scale multi-turn dialogue dataset containing multi-modal open-domain conversations derived from real human-human chat content in social media. MMDialog contains 1.08M dialogue sessions and 1.53M associated images. On average, one dialogue session has 2.59 images, which can be located anywhere at any conversation turn.

17 papers2 benchmarksTexts

PeMS08

PeMS08 is a traffic forecasting dataset.

17 papers6 benchmarks

SODA

SODA is a high-quality social dialogue dataset. In contrast to most existing crowdsourced, small-scale dialogue corpora, Soda distills 1.5M socially-grounded dialogues from a pre-trained language model (InstructGPT; Ouyang et al., ). Dialogues are distilled by contextualizing social commonsense knowledge from a knowledge graph (Atomic10x).

17 papers0 benchmarksDialog

SecurityEval

Automated source code generation is currently a popular machine learning-based task. It can be helpful for software developers to write functionally correct code from a given context. However, just like human developers, a code generation model can produce vulnerable code, which the developers can mistakenly use. For this reason, evaluating the security of a code generation model is a must. In this paper, we describe SecurityEval, an evaluation dataset to fulfill this purpose. It contains 130 samples for 75 vulnerability types, which are mapped to the Common Weakness Enumeration (CWE). We also demonstrate using our dataset to evaluate one open-source (i.e., InCoder) and one closed-source code generation model (i.e., GitHub Copilot).

17 papers0 benchmarks

PreviousPage 115 of 1000Next