Datasets

10 machine learning datasets

10 dataset results

FinTabNet

This dataset contains complex tables from the annual reports of S&P 500 companies with detailed table structure annotations to help table structure recognition and table data extraction. The dataset consists of 89,646 pages comprising 112,887 tables with cell structure annotated from IBM Research.

35 papers0 benchmarksFinancial, Images, Tabular

Earnings Call

The Earning Calls dataset consists of processed earning conference calls data (text and audio). It can be used to predict financial risk from both textual and vocal features from conference calls.

9 papers0 benchmarksFinancial, Texts

EDT

The EDT dataset is designed for corporate event detection and text-based stock prediction (trading strategy) benchmark.

5 papers0 benchmarksFinancial, Texts

default of credit card clients Data Set

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural networ

3 papers0 benchmarksFinancial

SupplyGraph (SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks)

Graph Neural Networks (GNNs) have gained traction across different domains such as transportation, bio-informatics, language processing, and computer vision. However, there is a noticeable absence of research on applying GNNs to supply chain networks. Supply chain networks are inherently graphlike in structure, making them prime candidates for applying GNN methodologies. This opens up a world of possibilities for optimizing, predicting, and solving even the most complex supply chain problems. A major setback in this approach lies in the absence of real-world benchmark datasets to facilitate the research and resolution of supply chain problem using GNNs. To address the issue, we present a real-world benchmark dataset for temporal tasks, obtained from one of the leading FMCG companies in Bangladesh, focusing on supply chain planning for production purposes. The dataset includes temporal data as node features to enable sales predictions, production planning, and the identification of fact

3 papers0 benchmarksFinancial, Graphs, Tabular, Time series

KPI-EDGAR

We introduce KPI-EDGAR, a novel dataset for Joint Named Entity Recognition and Relation Extraction building on financial reports uploaded to the Electronic Data Gathering, Analysis, and Retrieval (EDGAR) system, where the main objective is to extract Key Performance Indicators (KPIs) from financial documents (the named entity recognition part) and link them to their numerical values (the relation extraction part).

2 papers2 benchmarksFinancial, Texts

Car_Price_Prediction (Second_Hand-Car_Price_Prediction)

In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower (kw), Year Price (Lakhs)]

1 papers1 benchmarksFinancial, Texts

Financial Dynamic Knowledge Graph

FinDKG: The Global Financial Dynamic Knowledge Graph Dataset FinDKG is an open-source dataset focused on creating a temporally-resolved Financial Dynamic Knowledge Graph. Designed to bridge the gap in industry-specific knowledge graphs, particularly in the financial sector, FinDKG provides a high-touch, temporally-aware representation of global economic and market dynamics. This repository includes comprehensive details about the dataset, methodology, and schema, aiming to facilitate academic research and actionable insights in global financial markets.

1 papers0 benchmarksFinancial, Graphs, Texts

TRADES-LOB

TRADES-LOB comprises simulated TRADES market data for Tesla and Intel, for 29/01 and 30/01. Specifically, the dataset is structured into four CSV files, each containing 50 columns. The initial six columns delineate the order features, followed by 40 columns that represent a snapshot of the LOB across the top 10 levels. The concluding four columns provide key financial metrics: mid-price, spread, order volume imbalance, and Volume-Weighted Average Price (VWAP), which can be useful for downstream financial tasks, such as stock price prediction. In total, the dataset is composed of 265,986 rows and 13,299,300 cells, which is similar in size to the benchmark FI-2010 dataset.

1 papers0 benchmarksFinancial, Time series

MAEC (Multimodal Aligned Earnings Conference Call Dataset)

MAEC is a new, large-scale multi-modal, text-audio paired, earnings-call dataset named MAEC, based on S&P 1500 companies.

0 papers0 benchmarksAudio, Financial