3,148 machine learning datasets
3,148 dataset results
A blackout poetry dataset constructed from publicly available short stories and large poems. The dataset consists of two variants: 8K and 16K examples of passages along with a poem generated from the passage and the indices of the words in the passage from which words in the poem have been selected. The dataset also contains perplexity scores for each of the poems indicating the language quality of the poems.
ERD (Educational Resource Discovery) is a corpus of 39,728 manually labeled web resources and 659 queries from NLP, Computer Vision (CV), and Statistics (STATS) for educational resource discovery.
PhysNLU is a collection of 4 core datasets related to sentence classification, ordering, and coherence of physics explanations based on related tasks. Each dataset comprises explanations extracted from Wikipedia including derivations and mathematical language.
MetaEval is a collection of 101 NLP tasks. It consists of 101 tasks in a benchmark that can be used for future probing and transfer learning.
This is a dataset for Bengali Captioning from Images.
PaSa is a dataset to train Machine Learning algorithms to automate the highlighting of patent paragraphs with semantic annotations. It consists of 150k samples obtained by traversing USPTO patents over a decade
BanglaEmotion is a manually annotated Bangla Emotion corpus, which incorporates the diversity of fine-grained emotion expressions in social-media text. More fine-grained emotion labels are considered such as Sadness, Happiness, Disgust, Surprise, Fear and Anger - which are, according to Paul Ekman (1999), the six basic emotion categories. For this task, a large amount of raw text data are collected from the user’s comments on two different Facebook groups (Ekattor TV and Airport Magistrates) and from the public post of a popular blogger and activist Dr. Imran H Sarker. These comments are mostly reactions to ongoing socio-political issues and towards the economic success and failure of Bangladesh. A total of 32923 comments are scraped from the three sources aforementioned above. Out of these, a total of 6314 comments were annotated into the six categories. The distribution of the annotated corpus is as follows:
In this repository you can find all the elaborate results that were used for the simulated evaluation of an innovative, optimized for real-life use, STC-based, multi-robot Coverage Path Planning (mCPP) algorithm. For this evaluation were introduced in "Apostolidis, S. D., Kapoutsis, P. C., Kapoutsis, A. C., & Kosmatopoulos, E. B. (2022). Cooperative multi-UAV coverage mission planning platform for remote sensing applications. Autonomous Robots, 1-28." 20 ROIs, of different shapes and areas, that may include obstacles inside them. These ROIs along with some benchmark results can be found here: https://github.com/savvas-ap/cpp-simulated-evaluations
FIG-Loneliness (FIne-Grained Loneliness) is a dataset collected by using Reddit posts in two young adult-focused forums and two loneliness related forums consisting of a diverse age group. Annotations by trained human annotators for binary and fine-grained loneliness classifications of the posts are provided.
HS-BAN is a binary class hate speech (HS) dataset in Bangla language consisting of more than 50,000 labeled comments, including 40.17% hate and rest are non hate speech.
About the study This study was exploring the landscape of interpersonal conflicts during code review in following areas: - how these conflicts look like - what role do they play in software development - what are their consequences - what factors do play role in their appearance and severity - what strategies can be used to prevent and manage conflicts
The MedVidCL dataset contains a collection of 6, 617 videos annotated into ‘medical instructional’, ‘medical non-instructional' and ‘non-medical’ classes. A two-step approach is used to construct the MedVidCL dataset. In the first step, the videos annotated by health informatics experts are used to train a machine learning model that predicts the given video to one of the three aforementioned classes. In the second step, only the high-confidence videos are used and health informatics experts assess the model’s predicted video category and update the category wherever needed.
A prevalent use case of topic models is that of topic discovery. However, most of the topic model evaluation methods rely on abstract metrics such as perplexity or topic coherence. The topic coverage approach is to measure the models' performance by matching model-generated topics to topics discovered by humans. This way, the models are evaluated in the context of their use, by essentially simulating topic modeling in a fixed setting defined by a text collection and a set of reference topics.
Currently, an essential point in speech synthesis is the addressing of the variability of human speech. One of the main sources of this diversity is the emotional state of the speaker. Most of the recent work in this area has been focused on the prosodic aspects of speech and on rule-based formant synthesis experiments. Even when adopting an improved voice source, we cannot achieve a smiling happy voice or the menacing quality of cold anger. For this reason, we have performed two experiments aimed at developing a concatenative emotional synthesiser, a synthesiser that can copy the quality of an emotional voice without an explicit mathematical model.
The development of ecologically valid procedures for collecting reliable and unbiased emotional data towards computer interfaces with social and affective intelligence targeting patients with mental disorders. Following its development, presented with, the Athens Emotional States Inventory (AESI) proposes the design, recording and validation of an audiovisual database for five emotional states: anger, fear, joy, sadness and neutral. The items of the AESI consist of sentences each having content indicative of the corresponding emotion. Emotional content was assessed through a survey of 40 young participants with a questionnaire following the Latin square design. The emotional sentences that were correctly identified by 85% of the participants were recorded in a soundproof room with microphones and cameras. A preliminary validation of AESI is performed through automatic emotion recognition experiments from speech. The resulting database contains 696 recorded utterances in Greek language
ArgSciChat is an argumentative dialogue dataset. It consists of 498 messages collected from 41 dialogues on 20 scientific papers. It can be used to evaluate conversational agents and further encourage research on argumentative scientific agents.
Here I provided the datasets I used for this analysis. It includes the tweets I streamed using the Tweepy package on Python during the peach of the wildfire season in late summer/early fall of 2020.
This dataset is used for user identity linkage across two online social networks in Chinese. It contains two popular Chinese social platforms: Sina Weibo\footnote{https://weibo.com} and Douban\footnote{https://www.douban.com}.
Optical images of printed circuit boards as well as detailed annotations of any text, logos, and surface-mount devices (SMDs). There are several hundred samples spanning a wide variety of manufacturing locations, sizes, node technology, applications, and more.
A Natural Language Resource for Learning to Recognize Misinformation about the COVID-19 and HPV Vaccines.