271 machine learning datasets
271 dataset results
Natural Vertical Partitioned CVR Dataset for Vertical Federated Learning
Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.
Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.
Our SRSD (Feynman) datasets are designed to discuss the performance of Symbolic Regression for Scientific Discovery. We carefully reviewed the properties of each formula and its variables in the Feynman Symbolic Regression Database to design reasonably realistic sampling range of values so that our SRSD datasets can be used for evaluating the potential of SRSD such as whether or not an SR method con (re)discover physical laws from such datasets.
Main Dataset city_pollution_data.csv
Teaching assistants (TAs) are heavily used in computer science courses as a way to handle high enrollment and still being able to offer students individual tutoring and detailed assessments. This data is the result of a multi-institutional, multi-national perspective of challenges that TAs in computer science face. 180 reflective essays written by TAs from three institutions across Europe were analyzed and coded. The thematic analysis resulted in five main challenges: becoming a professional TA, student-focused challenges, assessment, defining and using best practice and threats to best practice. In addition, these challenges were all identified within the essays from all three institutions, indicating that the identified challenges are not particularly context-dependent. (2021-04-11)
Dataset was introduced by Jones Granatyr in his book https://iaexpert.academy/2016/10/25/review-de-livro-programando-a-inteligencia-coletiva where he scraped flight schedules.
To explore the nascent area of sustainable venture capital, a review of related research was conducted and social entrepreneurs & investors interviewed to construct a questionnaire assessing the interests and intentions of current & future ecosystem participants. Analysis of 114 responses received via several sampling methods revealed statistically significant relationships between investing preferences and genders, generations, sophistication, and other variables, all the way down to the level of individual UN Sustainable Development Goals (SDGs).
This dataset contains a two-column CSV file, where the first column ("Valid_citing_DOI") contains the DOI of a citing entity retrieved in Crossref, while the second column ("Invalid_cited_DOI") contains the invalid DOI of a cited entity identified by looking at the field "reference" in the JSON document returned by querying the Crossref API with the citing DOI.
Wikipedia is the largest and most read online free encyclopedia currently existing. As such, Wikipedia offers a large amount of data on all its own contents and interactions around them, as well as different types of open data sources. This makes Wikipedia a unique data source that can be analyzed with quantitative data science techniques. However, the enormous amount of data makes it difficult to have an overview, and sometimes many of the analytical possibilities that Wikipedia offers remain unknown. In order to reduce the complexity of identifying and collecting data on Wikipedia and expanding its analytical potential, after collecting different data from various sources and processing them, we have generated a dedicated Wikipedia Knowledge Graph aimed at facilitating the analysis, contextualization of the activity and relations of Wikipedia pages, in this case limited to its English edition. We share this Knowledge Graph dataset in an open way, aiming to be useful for a wide range
The Reddit Climate Change Dataset is a dataset of 620K Reddit posts and 4.6M comments - all mentions of the terms "climate" and "change" until 2022-09-01 across the entire Reddit social network. Both were procured with SocialGrep's export feature and released as part of SocialGrep Reddit datasets. The posts are labeled with their subreddit, title, creation date, domain, selftext, and score. The comments are labeled with their subreddit, body, creation date, sentiment (calculated for you using a VADER pipeline), and score.
The data used in - "Radio Galaxy Zoo EMU: Towards a Semantic Radio Galaxy Morphology Taxonomy" (Bowles et al. submitted) - "A New Task: Deriving Semantic Class Targets for the Physical Sciences" (Bowles et al. 2022: https://arxiv.org/abs/2210.14760) accepted at the Fifth Workshop on Machine Learning and the Physical Sciences, Neural Information Processing Systems 2022.
The SNS data (Valente et al., 2013) is a four-wave survey conducted in Los Angeles county, the United States, which features a sample of 1,795 high-school students. The survey collected information about high-school students between grades 10 to 12, a majority of them self-identified as Hispanic. Among the collected information we have socio-economic status, demographics, social networks, and consumption of alcohol, tobacco, and marijuana–substance use.
A maintained database tracks ICLR submissions and reviews, augmented with author profiles and higher-level textual features.
The dataset contains standard contexts of the lattices of all atomic lattices in the Concept Explorer format.
Outliers or anomalies are instances that do not conform to the norm of a dataset. Outlier detection is an important data mining problem that has been researched within diverse research areas and applications domains such as intrusion detection, fraud detection, unusual event detection, disease condition detection etc.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The dataset contains 30 million cryptocurrency-related tweets from 10.10.2020 to 3.3.2021. See https://github.com/meakbiyik/ask-who-not-what for more details.
The National Health and Nutrition Examination Survey (NHANES) provides data on the health and environmental exposure of the non-institutionalized US population. Such data have considerable potential to understand how the environment and behaviors impact human health. These data are also currently leveraged to answer public health questions such as prevalence of disease. However, these data need to first be processed before new insights can be derived through large-scale analyses. NHANES data are stored across hundreds of files with multiple inconsistencies. Correcting such inconsistencies takes systematic cross examination and considerable efforts but is required for accurately and reproducibly characterizing the associations between the exposome and diseases (e.g., cancer mortality outcomes). Thus, we developed a set of curated and unified datasets and accompanied code by merging 614 separate files and harmonizing unrestricted data across NHANES III (1988-1994) and Continuous (1999-20