19,997 machine learning datasets
19,997 dataset results
FairPrism is a dataset of 5,000 examples of AI-generated English text with detailed human annotations covering a diverse set of harms relating to gender and sexuality. FairPrism aims to address several limitations of existing datasets for measuring and mitigating fairness-related harms, including improved transparency, clearer specification of dataset coverage, and accounting for annotator disagreement and harms that are context-dependent. FairPrism’s annotations include the extent of stereotyping and demeaning harms, the demographic groups targeted, and appropriateness for different applications. The annotations also include specific harms that occur in interactive contexts and harms that raise normative concerns when the “speaker” is an AI system. Due to its precision and granularity, FairPrism can be used to diagnose (1) the types of fairness-related harms that AI text generation systems cause, and (2) the potential limitations of mitigation methods.
Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data
The BiGe corpus is comprised of 54.360 shots of interest extracted from TED and TEDx talks. All shots are tracked with fully 3d landmarks.
Celeb-HQ Facial Identity Recognition Dataset
Celeb-HQ Face Gender Recognition Dataset
We present a structured benchmark dataset for a representative vibroacoustic problem: Predicting the frequency response for vibrating plates. The vibrating plates benchmark dataset consists of in total 12,000 varied plate designs and accompanying vibration patterns, when the plates are excited by a harmonic force. These vibration platterns give the vibration velocity at every location of the plate orthogonal to its surface. The plate designs incorporate randomly placed beadings, indentations in the plate surface. The beadings stiffen the plates and completely change the resulting vibration patterns. Additionally, the size, thickness and damping loss factor of the plates are varied.
we introduce XIMAGENET-12, an explainable benchmark dataset with over 200K images and 15,600 manual semantic annotations. Covering 12 categories from ImageNet to represent objects commonly encountered in practical life and simulating six diverse scenarios, including overexposure, blurring, color changing, etc.,
Real-world dataset on forward facing scenes with different camera intrinisc parameters.
Multi-pose Anomaly Detection (MAD) dataset, which represents the first attempt to evaluate the performance of pose-agnostic anomaly detection. The MAD dataset containing 4,000+ highresolution multi-pose views RGB images with camera/pose information of 20 shape-complexed LEGO animal toys for training, as well as 7,000+ simulation and real-world collected RGB images (without camera/pose information) with pixel-precise ground truth annotations for three types of anomalies in test sets. Note that MAD has been further divided into MAD-Sim and MAD-Real for simulation-to-reality studies to bridge the gap between academic research and the demands of industrial manufacturing.
3DYoga90 is organized within a three-level label hierarchy. It stands out as one of the most comprehensive open datasets, featuring the largest collection of RGB videos and 3D skeleton sequences among publicly available resources.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
A new fraud detection dataset FDCompCN for detecting financial statement fraud of companies in China. We construct a multi-relation graph based on the supplier, customer, shareholder, and financial information disclosed in the financial statements of Chinese companies. These data are obtained from the China Stock Market and Accounting Research (CSMAR) database. We select samples between 2020 and 2023, including 5,317 publicly listed Chinese companies traded on the Shanghai, Shenzhen, and Beijing Stock Exchanges.
AIDA/testc is a new challenging test set for entity linking systems containing 131 Reuters news articles published between December 5th and 7th, 2020. It links the named entity mentions in this test set to their corresponding Wikipedia pages, using the same linking procedure employed in the original AIDA CoNLL-YAGO dataset. AIDA/testc has 1,160 unique Wikipedia identifiers, spanning over 3,777 mentions and encompassing a total of 46,456 words.
Clean version of UDHR (Universal Declaration of Human Rights), at the long sentence level.
We present a cellular microscopic image dataset for investigating channel-adaptive models. We collected and pre-processed images from three publicly available sources: 1) the WTC-11 hiPSC dataset from the Allen Institute (Viana et al., 2023), 2) the Human Protein Atlas dataset (Thul et al., 2017), and 3) a combined Cell Painting dataset from the Broad Institute (Gustafsdottir et al., 2013; Bray et al., 2017; Way et al., 2021). These images contain 3, 4, or 5 channels with different cellular structures highlighted in each channel. The goal of this dataset is to facilitate the creation and evaluation of novel computer vision models that are invariant to channel numbers.
The dataset includes two parts corresponding to the cities of Abakan (65524 nodes, 340012 edges) and Omsk (231688 nodes, 1149492 edges). Along with the road network graph, it includes trip records represented as sequences of visited nodes (making the dataset suitable both for path-blind and path-aware settings). There are two types of target values for a regression task: real travel time and real length of a trip.
The dataset consists of titles and abstracts from NLP-related papers. Each paper is annotated with multiple fields of study from an NLP taxonomy. The training dataset contains 178,521 weakly annotated samples. The test dataset consists of 828 manually annotated samples from the EMNLP22 conference. The manually labeled test dataset might not contain all possible classes since it consists of EMNLP22 papers only, and some rarer classes haven’t been published there. Therefore, we advise creating an additional test or validation set from the train data that includes all the possible classes.
Depth vision has been recently used in many locomotion devices with the objective to ease the life of disabled people toward reaching more ecological lifestyle. This is due to the fact that such cameras are cheap, compact and can provide rich information about the environment. Our dataset provides many recordings of point cloud and other types of data during different locomotion modes in urban context. If you used this data, please cite the following papers below: 1-Depth Vision based Terrain Detection Algorithm during Human Locomotion 2-Using Depth Vision for Terrain Detection during Active Locomotion
PragmaticCode is a dataset of real-world open-source Java projects complete with their development environments and dependencies (through their respective build systems). The authors tried to ensure that all the repositories in PragmaticCode were released publicly only after the determined training dataset cutoff date (31 March 2022) for the CodeGen, SantaCoder and text-davinci-003 family of models, which were used to evaluate MGD.
In order to fill the gap of HC3 under semanticinvariant tasks, we extend HC3 and propose a larger ChatGPT-generated text dataset covering translation, summarization, and paraphrasing tasks, called HC3 Plus.