Datasets

271 machine learning datasets

271 dataset results

GenoTEX (An LLM Agent Benchmark for Automated Gene Expression Data Analysis)

GenoTEX (Genomics Data Automatic Exploration Benchmark) is a benchmark dataset for the automated analysis of gene expression data to identify disease-associated genes while considering the influence of other biological factors. It provides analysis code and results for solving a wide range of gene-trait association (GTA) analysis problems, encompassing dataset selection, preprocessing, and statistical analysis, in a pipeline that follows computational genomics standards. The benchmark includes expert-curated annotations from bioinformaticians to ensure accuracy and reliability.

1 papers0 benchmarksTabular, Texts

BASIR (BASIR_Budget_Assisted_Sectoral_Impact_Ranking)

Government fiscal policies, particularly annual union budgets, exert significant influence on financial markets. However, real-time analysis of budgetary impacts on sector-specific equity performance remains methodologically challenging and largely unexplored. This study proposes a framework to systematically identify and rank sectors poised to benefit from India's Union Budget announcements. The framework addresses two core tasks: (1) multi-label classification of excerpts from budget transcripts into 81 predefined economic sectors, and (2) performance ranking of these sectors. Leveraging a comprehensive corpus of Indian Union Budget transcripts from 1947 to 2025, we introduce BASIR (Budget-Assisted Sectoral Impact Ranking), an annotated dataset mapping excerpts from budgetary transcripts to sectoral impacts.

1 papers0 benchmarksTabular, Texts

MiMIC (Multi-Modal Indian Earnings Calls Dataset)

Predicting stock market prices following corporate earnings calls remains a significant challenge for investors and researchers alike, requiring innovative approaches that can process diverse information sources. This study investigates the impact of corporate earnings calls on stock prices by introducing a multi-modal predictive model. We leverage textual data from earnings call transcripts, along with images and tables from accompanying presentations, to forecast stock price movements on the trading day immediately following these calls. To facilitate this research, we developed the MiMIC (Multi-Modal Indian Earnings Calls) dataset, encompassing companies representing the Nifty 50, Nifty MidCap 50, and Nifty Small 50 indices. The dataset includes earnings call transcripts, presentations, fundamentals, technical indicators, and subsequent stock prices. We present a multimodal analytical framework that integrates quantitative variables with predictive signals derived from textual and v

1 papers0 benchmarksImages, Tabular, Texts

Indic IPO Success

We present two multi-modal datasets, one for Main Board IPOs, and the other for Small and Medium Enterprises (SME) IPOs. It consists of various features relating to the company going for IPOs, and other macroeconomic factors. The objective is to estimate the direction and under pricing with respect to opening, high and closing prices of stocks on the IPOlisting day.

1 papers0 benchmarksImages, Tabular, Texts

PreRAID (Prescreening Rheumatoid Arthritis Information Database (PreRAID))

PreRAID is a structured dataset designed to evaluate the diagnostic capabilities of Large Language Models (LLMs) in Rheumatoid Arthritis (RA) diagnosis. This dataset provides real-world patient data, offering insights into RA prediction and reasoning accuracy.

1 papers0 benchmarksMedical, Tabular, Texts

Federal and State-Level Election Results since 1955

DOI: https://doi.org/10.7910/DVN/O4CRXK

1 papers0 benchmarksTabular

PsOCR (Pashto OCR Dataset)

PsOCR is a large-scale synthetic dataset for Optical Character Recognition in low-resource Pashto language.

1 papers0 benchmarksImages, Tabular, Texts

CSTS (Correlation Structures in Time Series)

CSTS: Correlation Structures in Time Series CSTS is a comprehensive synthetic benchmarking dataset designed specifically for evaluating correlation structure discovery in time series data. The dataset systematically models known correlation structures between time series variables and enables rigorous assessment of clustering algorithms and validation methods.

1 papers0 benchmarksTabular, Time series

TimeGraph (TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery)

TimeGraph is a comprehensive suite of synthetic datasets designed to benchmark causal discovery algorithms on time-series data. The dataset captures real-world complexities by incorporating temporal dynamics such as trends, seasonality, and nonstationarity, as well as sampling challenges including irregular time intervals and structured missingness. It features diverse noise types, including Gaussian, heavy-tailed, and heteroskedastic variations, and supports scenarios with latent confounding to enable evaluation under partially observed systems. The underlying causal structures span both linear and nonlinear relationships, including polynomial and trigonometric forms.

1 papers0 benchmarksTabular, Time series

Upper body thermal images and associated clinical data from a pilot cohort study of COVID-19

The prospective upper body thermal images SARS-CoV2 association study was designed to test the hypothesis that thermal videos may aid in the early diagnosis of COVID-19. The study recorded a set of measurements from 252 participants regarding PCR results, demographics, vital signs, participant activities, medications, respiratory symptoms, and a thermal video session where the volunteers performed simple breath-hold in four different positions. The acquired data may be used to test clinical association questions regarding temperature patterns, demographics, and vital signs. Furthermore, it could be valuable to develop new computer algorithms for extracting useful scientific information from thermal videos.

1 papers0 benchmarksImages, Tabular

Survey of Active Learning Hyperparameters

Annotating data is a time-consuming and costly task, but it is inherently required for supervised machine learning. Active Learning (AL) is an established method that minimizes human labeling effort by iteratively selecting the most informative unlabeled samples for expert annotation, thereby improving the overall classification performance. Even though AL has been known for decades [1], AL is still rarely used in real-world applications. As indicated in the two community web surveys among the NLP community about AL [2], [3], two main reasons continue to hold practitioners back from using AL: first, the complexity of setting AL up, and second, a lack of trust in its effectiveness. We hypothesize that both reasons share the same culprit: the large hyperparameter space of AL. This mostly unexplored hyperparameter space often leads to misleading and irreproducible AL experiment results. In this study, we first compiled a large hyperparameter grid of over 4.6 million hyperparameter combina

1 papers0 benchmarksTabular

Concrete Compressive Strength

Concrete is the most important material in civil engineering. The concrete compressive strength is a highly nonlinear function of age and ingredients.

1 papers2 benchmarksTabular

Medical Cost Personal Dataset

This dataset contains demographic and personal health information for individuals, along with the corresponding medical insurance charges billed to them. It is commonly used to build predictive models for insurance costs and to explore relationships between factors such as age, BMI, smoking status, and region on medical expenses.

1 papers2 benchmarksTabular

ai4st SLR (Research on AI for Software Testing Research, 2020-2025)

To check the validity of the ai4st ontology, an adapted, lightweight systematic literature review (SLR) was conducted to analyse related research. This SLR protocol was followed:

1 papers0 benchmarksTabular

[[Easy~refund]]How can i get a refund from expedia

𝑬𝒙𝒑𝒆𝒅𝒊𝒂 𝒂𝒍𝒍𝒐𝒔 𝒚𝒐𝒖 𝒕𝒐 𝒅𝒐 𝒔𝒐 𝒇𝒐𝒓 𝒇𝒓𝒆𝒆. 𝑾𝒉𝒆𝒕𝒉𝒆𝒓 𝒚𝒐 𝒃𝒐𝒐𝒌𝒆𝒅 𝒂 𝒓𝒆𝒇𝒖𝒏𝒅𝒂𝒃𝒍𝒆 𝒐 𝒏𝒐𝒏-𝒓𝒆𝒇𝒖𝒏𝒅𝒂𝒃𝒍𝒆 𝒕𝒊𝒄𝒌𝒆𝒕, 𝒚𝒐 𝒂𝒓𝒆 𝙃𝙤𝙬 𝙩𝙝𝙖𝙩 𝙬𝙞𝙡𝙡 𝙘𝙖𝙪𝙨𝙚 𝙨𝙪𝙛𝙛𝙚𝙧 𝙮𝙤𝙪 𝙨𝙪𝙛𝙛𝙚𝙧 𝙖𝙣𝙙 𝙬𝙞𝙡𝙡 𝙘𝙖𝙪𝙨𝙚 𝙖𝙧𝙚 𝙨𝙪𝙛𝙛𝙚𝙧 𝙮𝙤𝙪. 𝟐𝟒-𝑯𝒐𝒖𝒓 𝑭𝒓𝒆𝒆 𝑪𝒂𝒏𝒄𝒆𝒍𝒍𝒂𝒕𝒊𝒐𝒏: 𝑰𝒇 𝒚𝒐𝒖 𝒏𝒆𝒆𝒅 𝒕𝒐 𝒄𝒂𝒏𝒄𝒆𝒍 𝒚𝒐𝒖𝒓 𝒇𝒍𝒊𝒈𝒉𝒕 𝒘𝒊𝒕𝒉𝒊𝒏 𝟐𝟒 Yes Do you like it, tell me [ [+1-888-829-0881 (time) (time)] ] If you like it ✈📞[+1-888-829-0881 (time) (time)] Will I get a refund if I cancel?

1 papers0 benchmarksTabular

Hitchhiking Rides Dataset

Here the dataset described in Hitchhiking Rides Dataset: Two decades of crowd-sourced records on stochastic traveling(https://arxiv.org/abs/2506.21946) is published.

1 papers0 benchmarksTabular, Texts, Time series

TeleSim (TeleSim: A Network-Aware Testbed and Benchmark Dataset for Telerobotic Applications)

TeleSim is a network-aware hardware-in-the-loop dataset designed to evaluate the performance of telerobotic systems under varying network conditions. It includes 300 fine-manipulation trials using a 6-DoF robotic arm and simulated networks in OMNeT++. Each trial captures:

1 papers0 benchmarksTabular

7-digit Product-level Supply-Use and Input-Output Tables Using ASI Data

This paper constructs 7-digit product Supply-Use Tables (SUTs) and symmetric Input-Output Tables (IOTs) for the Indian economy using microdata from the Annual Survey of Industries (ASI) for the period 2016-2021. We outline the methodology for generating input flows and reconciling registered and unregistered sector data via NPCMS-NIC concordance. The transition from SUTs to IOTs is explained using the Industry Technology Assumption. We apply this framework to analyse the economic impact—specifically Domestic Value Added (DVA) and employment influenced by production and exports. A case study of India's mobile phone sector reveals significant output growth, import substitution, an increase in exports, a shift in DVA/FVA shares, notable employment growth, with a leaning towards contractual labour, and increased female participation. These tables are valuable for analysing sectoral interdependencies and industrial policy effectiveness in India.

1 papers0 benchmarksGraphs, Images, Tabular, Texts, Time series

The Reddit COVID Dataset

The Reddit COVID Dataset is a dataset of 4.51M Reddit posts and 17.8M comments - all mentions of COVID until 2021-10-25 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

0 papers0 benchmarksTabular, Texts

SMCOVID19-CT (Contact Tracing Data (from Italian SM-COVID-19 App))

We present a real data analysis of a CT experiment that was conducted in Italy for 8 months and involved more than 100,000 CT app users.

0 papers0 benchmarksTabular, Texts

PreviousPage 13 of 14Next