Datasets

52 machine learning datasets

52 dataset results

TAT-QA

TAT-QA (Tabular And Textual dataset for Question Answering) is a large-scale QA dataset, aiming to stimulate progress of QA research over more complex and realistic tabular and textual data, especially those requiring numerical reasoning.

76 papers1 benchmarksTables, Texts

WebSRC (WebSRC: A Dataset for Web-Based Structural Reading Comprehension)

WebSRC is a novel Web-based Structural Reading Comprehension dataset. It consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no.

22 papers2 benchmarksImages, Tables, Texts

GitTables

GitTables is a corpus of currently 1M relational tables extracted from CSV files in GitHub covering 96 topics. Table columns in GitTables have been annotated with more than 2K different semantic types from Schema.org and DBpedia. The column annotations consist of semantic types, hierarchical relations, range types, table domain and descriptions.

16 papers0 benchmarksTables

SinD (A Drone Dataset at Signalized Intersection in China)

The SIND dataset is based on 4K video captured by drones, providing information including traffic participant trajectories, traffic light status, and high-definition maps

14 papers0 benchmarksTables

VNAT (VPN/NONVPN NETWORK APPLICATION TRAFFIC DATASET)

This dataset is a collection of labelled PCAP files, both encrypted and unencrypted, across 10 applications, as well as a pandas dataframe in HDF5 format containing detailed metadata summarizing the connections from those files. It was created to assist the development of machine learning tools that would allow operators to see the traffic categories of both encrypted and unencrypted traffic flows. In particular, features of the network packet traffic timing and size information (both inside of and outside of the VPN) can be leveraged to predict the application category that generated the traffic.

5 papers0 benchmarksTables, Time series

M5Product

The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.

4 papers0 benchmarksAudio, Images, Tables, Texts, Videos

eICU-CRD (eICU Collaborative Research Database)

The eICU Collaborative Research Database is a large multi-center critical care database made available by Philips Healthcare in partnership with the MIT Laboratory for Computational Physiology.

4 papers0 benchmarksMedical, Tables, Tabular, Time series

SKAB (Skoltech Anomaly Benchmark)

SKAB is designed for evaluating algorithms for anomaly detection. The benchmark currently includes 30+ datasets plus Python modules for algorithms’ evaluation. Each dataset represents a multivariate time series collected from the sensors installed on the testbed. All instances are labeled for evaluating the results of solving outlier detection and changepoint detection problems.

3 papers4 benchmarksTables, Time series

MMCode

MMCode is a multi-modal code generation dataset designed to evaluate the problem-solving skills of code language models in visually rich contexts (i.e. images). It contains 3,548 questions paired with 6,620 images, derived from real-world programming challenges across 10 code competition websites, with Python solutions and tests provided. The dataset emphasizes the extreme demand for reasoning abilities, the interwoven nature of textual and visual contents, and the occurrence of questions containing multiple images.

3 papers0 benchmarksImages, Tables, Texts

ArxivPapers

The ArxivPapers dataset is an unlabelled collection of over 104K papers related to machine learning and published on arXiv.org between 2007–2020. The dataset includes around 94K papers (for which LaTeX source code is available) in a structured form in which paper is split into a title, abstract, sections, paragraphs and references. Additionally, the dataset contains over 277K tables extracted from the LaTeX papers.

2 papers0 benchmarksTables, Texts

SegmentedTables

The SegmentedTables dataset is a collection of almost 2,000 tables extracted from 352 machine learning papers. Each table consists of rich text content, layout and caption. Tables are annotated with types (leaderboard, ablation, irrelevant) and cells of relevant tables are annotated with semantic roles (such as “paper model”, “competing model”, “dataset”, “metric”).

2 papers0 benchmarksTables, Texts

Multivariate-Mobility-Paris

The original dataset was provided by Orange telecom in France, which contains anonymized and aggregated human mobility data. The Multivariate-Mobility-Paris dataset comprises information from 2020-08-24 to 2020-11-04 (72 days during the COVID-19 pandemic), with time granularity of 30 minutes and spatial granularity of 6 coarse regions in Paris, France. In other words, it represents a multivariate time series dataset.

2 papers0 benchmarksTables, Tabular

GIRT-Data (GitHub Issue Report Template Dataset)

GIRT-Data is the first and largest dataset of issue report templates (IRTs) in both YAML and Markdown format. This dataset and its corresponding open-source crawler tool are intended to support research in this area and to encourage more developers to use IRTs in their repositories. The stable version of the dataset contains 1_084_300 repositories, and 50_032 of them support IRTs.

2 papers0 benchmarksTables, Tabular, Texts

SheetCopilot

The SheetCopilot dataset contains 28 evaluation workbooks and 221 spreadsheet manipulation tasks that are applied to these workbooks. These tasks involve diverse atomic actions related to six task categories (i.e. Entry and manipulation, Formatting, Management, Charts, Pivot Table, and Formula).

2 papers1 benchmarksTables

Large-scale Ridesharing DARP Instances Based on Real Travel Demand

This dataset presents a set of large-scale ridesharing Dial-a-Ride Problem (DARP) instances. The instances were created as a standardized set of ridesharing DARP problems for the purpose of benchmarking and comparing different solution methods.

2 papers0 benchmarksGraphs, Tables, Tabular, Time series

Reddit Ideological and Extreme Bias Dataset

Articles originating from subreddits with explicitly stated ideologies are categorized into three groups: 72,488 articles in the Liberal class, 79,573 articles in the Conservative class, and 225,083 articles in the Restricted class.

2 papers2 benchmarksTables, Tabular, Texts

Database of axial impact simulations of the crash box (Database for crashworthiness optimisation)

This repository contains the database of the FEM simulation of axially impacted various configurations of the square crash boxes. This database records the impact of the structural and crash test parameters on the various crashworthiness objectives.

2 papers0 benchmarksTables, Tabular

PEM Fuel Cell Dataset (Proton Exchange Membrane (PEM) Fuel Cell Dataset)

This dataset are about Nafion 112 membrane standard tests and MEA activation tests of PEM fuel cell in various operation condition. Dataset include two general electrochemical analysis method, Polarization and Impedance curves. In this dataset, effect of different pressure of H2/O2 gas, different voltages and various humidity conditions in several steps are considered. Behavior of PEM fuel cell during distinct operation condition tests, activation procedure and different operation condition before and after activation analysis can be concluded from data. In Polarization curves, voltage and power density change as a function of flows of H2/O2 and relative humidity. Resistance of the used equivalent circuit of fuel cell can be calculated from Impedance data. Thus, experimental response of the cell is obvious in the presented data, which is useful in depth analysis, simulation and material performance investigation in PEM fuel cell researches.

1 papers0 benchmarksTables, Tabular

DBFC Dataset (Single Direct Borohydride Fuel Cell Dataset)

This dataset includes Direct Borohydride Fuel Cell (DBFC) impedance and polarization test in anode with Pd/C, Pt/C and Pd decorated Ni–Co/rGO catalysts. In fact, different concentration of Sodium Borohydride (SBH), applied voltages and various anode catalysts loading with explanation of experimental details of electrochemical analysis are considered in data. Voltage, power density and resistance of DBFC change as a function of weight percent of SBH (%), applied voltage and amount of anode catalyst loading that are evaluated by polarization and impedance curves with using appropriate equivalent circuit of fuel cell. Can be stated that interpretation of electrochemical behavior changes by the data of related cell is inevitable, which can be useful in simulation, power source investigation and depth analysis in DB fuel cell researches.

1 papers0 benchmarksTables, Tabular

Nelson-Plosser (Nelson-Plosser US Macroeconomic Time Series)

US Macroeconomic dataset containing 14 time series of monthly observations. They have various lengths but all end in 1988. The variables: consumer price index, industrial production, nominal GNP, velocity, employment, interest rate, nominal wages, GNP deflator, money stock, real GNP, stock prices (S&P500), GNP per capita, real wages, unemployment.

1 papers0 benchmarksTables, Time series

Page 1 of 3Next