Datasets

19,997 machine learning datasets

19,997 dataset results

FairPrism

FairPrism is a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods.

2 papers0 benchmarksTexts

PFD (Playing for Data: Ground Truth from Computer Games)

Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data

2 papers0 benchmarks

BiGe (Bielefeld Gesture Corpus)

The BiGe corpus is comprised of 54.360 shots of interest extracted from TED and TEDx talks. All shots are tracked with fully 3d landmarks.

2 papers0 benchmarksAudio, Point cloud, Texts

Celeb-HQ Facial Identity Recognition Dataset

2 papers0 benchmarksImages

Celeb-HQ Face Gender Recognition Dataset

2 papers0 benchmarksImages

Vibrating Plates (Vibrating Plates Dataset for Vibroacoustic Frequency Response Prediction)

We present a structured benchmark dataset for a representative vibroacoustic problem: Predicting the frequency response for vibrating plates. The vibrating plates benchmark dataset consists of in total 12,000 varied plate designs and accompanying vibration patterns, when the plates are excited by a harmonic force. These vibration platterns give the vibration velocity at every location of the plate orthogonal to its surface. The plate designs incorporate randomly placed beadings, indentations in the plate surface. The beadings stiffen the plates and completely change the resulting vibration patterns. Additionally, the size, thickness and damping loss factor of the plates are varied.

2 papers0 benchmarksPhysics

XImageNet (XIMAGENET-12: An Explainable AI Benchmark Dataset for Model Robustness Evaluation)

we introduce XIMAGENET-12, an explainable benchmark dataset with over 200K images and 15,600 manual semantic annotations. Covering 12 categories from ImageNet to represent objects commonly encountered in practical life and simulating six diverse scenarios, including overexposure, blurring, color changing, etc.,

2 papers0 benchmarks

iFF (Intrinsic Forward Facing)

Real-world dataset on forward facing scenes with different camera intrinisc parameters.

2 papers3 benchmarks

PAD Dataset (Pose-agnostic/Multi-pose Anomaly Detection Dataset)

Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD dataset containing 4,000+ highresolution multi-pose views RGB images with camera/pose information of 20 shape-complexed LEGO animal toys for training, as well as 7,000+ simulation and real-world collected RGB images (without camera/pose information) with pixel-precise ground truth annotations for three types of anomalies in test sets. Note that MAD has been further divided into MAD-Sim and MAD-Real for simulation-to-reality studies to bridge the gap between academic research and the demands of industrial manufacturing.

2 papers2 benchmarksImages

3DYoga90 (3DYoga90: A Hierarchical Video Dataset for Yoga Pose Understanding)

3DYoga90 is organized within a three-level label hierarchy. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources.

2 papers0 benchmarks3D, Actions, RGB Video, Videos

CORE (Company Relation Extraction)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarksTexts

FDCompCN

A new fraud detection dataset FDCompCN for detecting financial statement fraud of companies in China. We construct a multi-relation graph based on the supplier, customer, shareholder, and financial information disclosed in the financial statements of Chinese companies. These data are obtained from the China Stock Market and Accounting Research (CSMAR) database. We select samples between 2020 and 2023, including 5,317 publicly listed Chinese companies traded on the Shanghai, Shenzhen, and Beijing Stock Exchanges.

2 papers2 benchmarks

AIDA/testc

AIDA/testc is a new challenging test set for entity linking systems containing 131 Reuters news articles published between December 5th and 7th, 2020. It links the named entity mentions in this test set to their corresponding Wikipedia pages, using the same linking procedure employed in the original AIDA CoNLL-YAGO dataset. AIDA/testc has 1,160 unique Wikipedia identifiers, spanning over 3,777 mentions and encompassing a total of 46,456 words.

2 papers1 benchmarksTexts

udhr-lid

Clean version of UDHR (Universal Declaration of Human Rights), at the long sentence level.

2 papers0 benchmarksTexts

CHAMMI (CHAMMI: A benchmark for channel-adaptive models in microscopy imaging)

We present a cellular microscopic image dataset for investigating channel-adaptive models. We collected and pre-processed images from three publicly available sources: 1) the WTC-11 hiPSC dataset from the Allen Institute (Viana et al., 2023), 2) the Human Protein Atlas dataset (Thul et al., 2017), and 3) a combined Cell Painting dataset from the Broad Institute (Gustafsdottir et al., 2013; Bray et al., 2017; Way et al., 2021). These images contain 3, 4, or 5 channels with different cellular structures highlighted in each channel. The goal of this dataset is to facilitate the creation and evaluation of novel computer vision models that are invariant to channel numbers.

2 papers0 benchmarksImages

TTE-A&O (Travel Time Estimation: Abakan and Omsk)

The dataset includes two parts corresponding to the cities of Abakan (65524 nodes, 340012 edges) and Omsk (231688 nodes, 1149492 edges). Along with the road network graph, it includes trip records represented as sequences of visited nodes (making the dataset suitable both for path-blind and path-aware settings). There are two types of target values for a regression task: real travel time and real length of a trip.

2 papers2 benchmarksGraphs, Images

NLP Taxonomy Classification Data

The dataset consists of titles and abstracts from NLP-related papers. Each paper is annotated with multiple fields of study from an NLP taxonomy. The training dataset contains 178,521 weakly annotated samples. The test dataset consists of 828 manually annotated samples from the EMNLP22 conference. The manually labeled test dataset might not contain all possible classes since it consists of EMNLP22 papers only, and some rarer classes haven’t been published there. Therefore, we advise creating an additional test or validation set from the train data that includes all the possible classes.

2 papers0 benchmarksTexts

3D-Point Cloud dataset of various geometrical terrains (3D-Point Cloud dataset of various geometrical terrains in urban environments recorded during human locomotion)

Depth vision has been recently used in many locomotion devices with the objective to ease the life of disabled people toward reaching more ecological lifestyle. This is due to the fact that such cameras are cheap, compact and can provide rich information about the environment. Our dataset provides many recordings of point cloud and other types of data during different locomotion modes in urban context. If you used this data, please cite the following papers below: 1-Depth Vision based Terrain Detection Algorithm during Human Locomotion 2-Using Depth Vision for Terrain Detection during Active Locomotion

2 papers0 benchmarks3D, Images, Point cloud, RGB-D

PragmaticCode

PragmaticCode is a dataset of real-world open-source Java projects complete with their development environments and dependencies (through their respective build systems). The authors tried to ensure that all the repositories in PragmaticCode were released publicly only after the determined training dataset cutoff date (31 March 2022) for the CodeGen, SantaCoder and text-davinci-003 family of models, which were used to evaluate MGD.

2 papers0 benchmarks

HC3 Plus

In order to fill the gap of HC3 under semanticinvariant tasks, we extend HC3 and propose a larger ChatGPT-generated text dataset covering translation, summarization, and paraphrasing tasks, called HC3 Plus.

2 papers0 benchmarks

PreviousPage 342 of 1000Next