Datasets

271 machine learning datasets

271 dataset results

adVFed (Tencent Federated Advertising CVR Dataset)

Natural Vertical Partitioned CVR Dataset for Vertical Federated Learning

SRSD-Feynman (Easy set)

Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.

1 papers0 benchmarksTables, Tabular

SRSD-Feynman (Hard set)

1 papers0 benchmarksTables, Tabular

SRSD-Feynman (Medium set)

1 papers0 benchmarksTables, Tabular

DEAP City Dataset

Main Dataset city_pollution_data.csv

1 papers0 benchmarksEnvironment, Graphs, Tabular, Time series

Reflective essays on CS TA experience

Teaching assistants (TAs) are heavily used in computer science courses as a way to handle high enrollment and still being able to offer students individual tutoring and detailed assessments. This data is the result of a multi-institutional, multi-national perspective of challenges that TAs in computer science face. 180 reflective essays written by TAs from three institutions across Europe were analyzed and coded. The thematic analysis resulted in five main challenges: becoming a professional TA, student-focused challenges, assessment, defining and using best practice and threats to best practice. In addition, these challenges were all identified within the essays from all three institutions, indicating that the identified challenges are not particularly context-dependent. (2021-04-11)

1 papers0 benchmarksTabular, Texts

Flight Scheduling Data

Dataset was introduced by Jones Granatyr in his book https://iaexpert.academy/2016/10/25/review-de-livro-programando-a-inteligencia-coletiva where he scraped flight schedules.

1 papers0 benchmarksTabular

Sustainable Venture Capital Survey 2022

To explore the nascent area of sustainable venture capital, a review of related research was conducted and social entrepreneurs & investors interviewed to construct a questionnaire assessing the interests and intentions of current & future ecosystem participants. Analysis of 114 responses received via several sampling methods revealed statistically significant relationships between investing preferences and genders, generations, sophistication, and other variables, all the way down to the level of individual UN Sustainable Development Goals (SDGs).

1 papers0 benchmarksTabular

Citations to invalid DOI-identified entities obtained from processing DOI-to-DOI citations to add in COCI

This dataset contains a two-column CSV file, where the first column ("Valid_citing_DOI") contains the DOI of a citing entity retrieved in Crossref, while the second column ("Invalid_cited_DOI") contains the invalid DOI of a cited entity identified by looking at the field "reference" in the JSON document returned by querying the Crossref API with the citing DOI.

1 papers0 benchmarksTabular

Wikipedia Knowledge Graph dataset

Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range

1 papers0 benchmarksTabular

The Reddit Climate Change Dataset

The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.

1 papers0 benchmarksTabular, Texts

RGZ EMU: Semantic Taxonomy (Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy)

The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.

1 papers0 benchmarksImages, Tabular, Texts

Social Network Study

The SNS data (Valente et al., 2013) is a four-wave survey conducted in Los Angeles county, the United States, which features a sample of 1,795 high-school students. The survey collected information about high-school students between grades 10 to 12, a majority of them self-identified as Hispanic. Among the collected information we have socio-economic status, demographics, social networks, and consumption of alcohol, tobacco, and marijuana–substance use.

1 papers0 benchmarksTabular, Time series

ICLR Database (ICLR Database (with Textual Covariates))

A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.

1 papers0 benchmarksTabular, Texts

standard atomic contexts (standard contexts for the lattices of atomic lattices)

The dataset contains standard contexts of the lattices of all atomic lattices in the Concept Explorer format.

1 papers0 benchmarksTabular

ODDS (Outlier Detection DataSets (ODDS))

Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important data mining problem that has been researched within diverse research areas and applications domains such as intrusion detection, fraud detection, unusual event detection, disease condition detection etc.

1 papers2 benchmarksTabular, Time series

Knowledge Graph Maturity Model

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTabular

A comparison of different maturity models

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTabular

Crypto related tweets from 10.10.2020 to 3.3.2021

The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.

1 papers0 benchmarksTabular, Texts

Harmonized US National Health and Nutrition Examination Survey (NHANES) 1988-2018

The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-20

1 papers0 benchmarksBiomedical, Environment, Tabular

PreviousPage 8 of 14Next