271 machine learning datasets
271 dataset results
The Abt-Buy dataset for entity resolution derives from the online retailers Abt.com and Buy.com. The dataset contains 1081 entities from abt.com and 1092 entities from buy.com as well as a gold standard (perfect mapping) with 1097 matching record pairs between the two data sources. The common attributes between the two data sources are: product name, product description and product price.
OpenXAI is the first general-purpose lightweight library that provides a comprehensive list of functions to systematically evaluate the quality of explanations generated by attribute-based explanation methods. OpenXAI supports the development of new datasets (both synthetic and real-world) and explanation methods, with a strong bent towards promoting systematic, reproducible, and transparent evaluation of explanation methods.
The T2Dv2 dataset consists of 779 tables originating from the English-language subset of the WebTables corpus. 237 tables are annotated for the Table Type Detection task, 236 for the Columns Property Annotation (CPA) task and 235 for the Row Annotation task. The annotations that are used are DBpedia types, properties and entities.
ACS PUMS stands for American Community Survey (ACS) Public Use Microdata Sample (PUMS) and has been used to construct several tabular datasets for studying fairness in machine learning:
The ToughTables (2T) dataset was created for the SemTab challenge and includes 180 tables in total. The tables in this dataset can be categorized in two groups: the control (CTRL) group tables and tough (TOUGH) group tables.
Bank Account Fraud (BAF) is a large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized, real-world bank account opening fraud detection dataset.
The dataset contains transactions made by credit cards in September 2013 by European cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
HANNA, a large annotated dataset of Human-ANnotated NArratives for Automatic Story Generation (ASG) evaluation, has been designed for the benchmarking of automatic metrics for ASG. HANNA contains 1,056 stories generated from 96 prompts from the WritingPrompts dataset. Each prompt is linked to a human story and to 10 stories generated by different ASG systems. Each story was annotated on six human criteria (Relevance, Coherence, Empathy, Surprise, Engagement and Complexity) by three raters. HANNA also contains the scores produced by 72 automatic metrics on each story.
The WikiTables-TURL dataset was constructed by the authors of TURL and is based on the WikiTable corpus, which is a large collection of Wikipedia tables. The dataset consists of 580,171 tables divided into fixed training, validation and testing splits. Additionally, the dataset contains metadata about each table, such as the table name, table caption and column headers.
SOTAB V2 features two annotation tasks: Column Type Annotation (CTA) and Columns Property Annotation (CPA). The goal of the Column Type Annotation (CTA) task is to annotate the columns of a table using 82 types from the Schema.org vocabulary, such as telephone, Duration, Mass, or Organization. The goal of the Columns Property Annotation (CPA) task is to annotate pairs of table columns with one out of 108 Schema.org properties, such as gtin, startDate, priceValidUntil, or recipeIngredient. The benchmark consists of 45,834 tables annotated for CTA and 30,220 tables annotated for CPA originating from 55,511 different websites. The tables are split into training-, validation- and test sets for both tasks. The tables cover 17 popular Schema.org types including Product, LocalBusiness, Event, and JobPosting.
This resource, our Concepticon, links concept labels from different conceptlists to concept sets. Each concept set is given a unique identifier, a unique label, and a human-readable definition. Concept sets are further structured by defining different relations between the concepts, as you can see in the graphic to the right, which displays the relations between concept sets linked to the concept set SIBLING. The resource can be used for various purposes. Serving as a rich reference for new and existing databases in diachronic and synchronic linguistics, it allows researchers a quick access to studies on semantic change, cross-linguistic polysemies, and semantic associations.
AnoShift is a large-scale anomaly detection benchmark, which focuses on splitting the test data based on its temporal distance to the training set, introducing three testing splits: IID, NEAR, and FAR. This testing scenario proves to capture the in-time performance degradation of anomaly detection methods for classical to masked language models.
WDC Products is an entity matching benchmark which provides for the systematic evaluation of matching systems along combinations of three dimensions while relying on real-word data. The three dimensions are
This data was extracted from the 1994 Census bureau database by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics). A set of reasonably clean records was extracted using the following conditions: ((AAGE>16) && (AGI>100) && (AFNLWGT>1) && (HRSWK>0)). The prediction task is to determine whether a person makes over $50K a year.
Median house prices for California districts derived from the 1990 census.
What do the instances in this dataset represent?
The Musk dataset describes a set of molecules, and the objective is to detect musks from non-musks. This dataset describes a set of 92 molecules of which 47 are judged by human experts to be musks and the remaining 45 molecules are judged to be non-musks. There are 166 features available that describe the molecules based on the shape of the molecule.
The Musk2 dataset is a set of 102 molecules of which 39 are judged by human experts to be musks and the remaining 63 molecules are judged to be non-musks. Each instance corresponds to a possible configuration of a molecule. The 166 features that describe these molecules depend upon the exact shape, or conformation, of the molecule.
The eSports Sensors dataset contains sensor data collected from 10 players in 22 matches in League of Legends. The sensor data collected includes: