Datasets

271 machine learning datasets

271 dataset results

Abt-Buy

The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.

19 papers6 benchmarksTabular

OpenXAI

OpenXAI is the first general-purpose lightweight library that provides a comprehensive list of functions to systematically evaluate the quality of explanations generated by attribute-based explanation methods. OpenXAI supports the development of new datasets (both synthetic and real-world) and explanation methods, with a strong bent towards promoting systematic, reproducible, and transparent evaluation of explanation methods.

17 papers0 benchmarksTabular

T2Dv2

The T2Dv2 dataset consists of 779 tables originating from the English-language subset of the WebTables corpus. 237 tables are annotated for the Table Type Detection task, 236 for the Columns Property Annotation (CPA) task and 235 for the Row Annotation task. The annotations that are used are DBpedia types, properties and entities.

14 papers4 benchmarksTabular

ACS PUMS

ACS PUMS stands for American Community Survey (ACS) Public Use Microdata Sample (PUMS) and has been used to construct several tabular datasets for studying fairness in machine learning:

11 papers0 benchmarksTabular

Tough Tables

The ToughTables (2T) dataset was created for the SemTab challenge and includes 180 tables in total. The tables in this dataset can be categorized in two groups: the control (CTRL) group tables and tough (TOUGH) group tables.

11 papers0 benchmarksTabular

BAF (Bank Account Fraud)

Bank Account Fraud (BAF) is a large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized, real-world bank account opening fraud detection dataset.

10 papers0 benchmarksTabular

Kaggle-Credit Card Fraud Dataset

The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

8 papers7 benchmarksTabular

WDC LSPM

Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.

8 papers0 benchmarksTabular

HANNA (HANNA, a large annotated dataset of Human-ANnotated NArratives for ASG evaluation.)

HANNA, a large annotated dataset of Human-ANnotated NArratives for Automatic Story Generation (ASG) evaluation, has been designed for the benchmarking of automatic metrics for ASG. HANNA contains 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each prompt is linked to a human story and to 10 stories generated by different ASG systems. Each story was annotated on six human criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity) by three raters. HANNA also contains the scores produced by 72 automatic metrics on each story.

8 papers0 benchmarksTabular

WikiTables-TURL

The WikiTables-TURL dataset was constructed by the authors of TURL and is based on the WikiTable corpus, which is a large collection of Wikipedia tables. The dataset consists of 580,171 tables divided into fixed training, validation and testing splits. Additionally, the dataset contains metadata about each table, such as the table name, table caption and column headers.

7 papers0 benchmarksTabular

WDC SOTAB V2

SOTAB V2 features two annotation tasks: Column Type Annotation (CTA) and Columns Property Annotation (CPA). The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table using 82 types from the Schema.org vocabulary, such as telephone, Duration, Mass, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 108 Schema.org properties, such as gtin, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 45,834 tables annotated for CTA and 30,220 tables annotated for CPA originating from 55,511 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting.

7 papers2 benchmarksTabular

Concepticon (Concepticon. A Resource for the Linking of Concept Lists)

This resource, our Concepticon, links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts, as you can see in the graphic to the right, which displays the relations between concept sets linked to the concept set SIBLING. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.

6 papers0 benchmarksTabular

AnoShift (AnoShift: A Distribution Shift Benchmark for Unsupervised Anomaly Detection)

AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to the training set, introducing three testing splits: IID, NEAR, and FAR. This testing scenario proves to capture the in-time performance degradation of anomaly detection methods for classical to masked language models.

6 papers8 benchmarksTabular, Time series

WDC Products

WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are

6 papers2 benchmarksTabular, Texts

Adult Census Income (adult_census_income)

This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.

6 papers4 benchmarksTabular

California Housing Prices

Median house prices for California districts derived from the 1990 census.

6 papers6 benchmarksTabular

Diabetes (Diabetes 130-US Hospitals for Years 1999-2008)

What do the instances in this dataset represent?

5 papers6 benchmarksTabular

Musk v1

The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes a set of 92 molecules of which 47 are judged by human experts to be musks and the remaining 45 molecules are judged to be non-musks. There are 166 features available that describe the molecules based on the shape of the molecule.

4 papers3 benchmarksTabular

Musk v2

The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule.

4 papers2 benchmarksTabular

eSports Sensors Dataset

The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes:

4 papers6 benchmarks6D, Actions, Biomedical, EEG, Environment, Replay data, Tabular, Time series, Tracking

PreviousPage 2 of 14Next