3,148 machine learning datasets
3,148 dataset results
Teaching assistants (TAs) are heavily used in computer science courses as a way to handle high enrollment and still being able to offer students individual tutoring and detailed assessments. This data is the result of a multi-institutional, multi-national perspective of challenges that TAs in computer science face. 180 reflective essays written by TAs from three institutions across Europe were analyzed and coded. The thematic analysis resulted in five main challenges: becoming a professional TA, student-focused challenges, assessment, defining and using best practice and threats to best practice. In addition, these challenges were all identified within the essays from all three institutions, indicating that the identified challenges are not particularly context-dependent. (2021-04-11)
The dataset contains 36000 Bangla data based on Ekman's six basic emotions. This data was first introduced in the paper Alternative non-BERT model choices for the textual classification in low-resource languages and environments. The whole dataset is balanced and evenly distributed among all the six classes.
Capriccio is a sentiment classification dataset on tweets that simulates data drift. It is created by slicing the Sentiment140 dataset (homepage, Huggingface datasets) with a sliding window of 500,000 tweets, resulting in 38 slices. Thus, each slice can be used to represent the training/validation dataset of a sentiment classification model that is re-trained every day. Each slice has 425,000 tweets for training (file named %d_train.json) and 75,000 tweets for validation (file named %d_val.json).
SentimentArcs’ reference corpus for novels consists of 25 narratives selected to create a diverse set of well recognized novels that can serve as a benchmark for future studies. The composition of the corpora was limited by the effect of copyright laws as well as historical imbalances. Most works were obtained from US and Australian Gutenberg Projects. The corpora is expected to grow in size and diversity over time.
This dataset contains samples of CTI (Cyber Threat Intelligence) data in natural language, labeled with the corresponding adversarial techniques from the MITRE ATT&CK framework.
The datasets of "Reinforcement Learning-enhanced Shared-account Cross-domain Sequential Recommendation" (TKDE 2022)
The datasets of "Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation" (TNNLs 2022)
We provide a custom synthetic bimodal dataset, called GeBiD, designed specifically for the comparison of the joint- and cross-generative capabilities of Multimodal Variational Autoencoders. It comprises RGB images of geometric primitives and textual descriptions. The dataset offers 5 levels of difficulty (based on the number of attributes) to find the minimal functioning scenario for each model. Moreover, its rigid structure enables automatic qualitative evaluation of the generated samples.
This dataset is named as the DistNLI dataset, which is a synthesized benchmark aiming to probe neural network models from the aspect of conjunctions on distributivity in NLI task in American English. DistNLI consists of sentence minimal pairs (premise and hypothesis) differentiated with conjunction structure within the pair and distributivity-related linguistic phenomenon. DistNLI is compiled with 328 sentences so far (164 for distributive and 164 for ambiguous predicates), annotated by 4 proficient English speakers with a background in NLP and Linguistics. Due to the specificity of the linguistic phenomenon involved and its size, this DistNLI dataset should only be used as an adversarial dataset in the investigation of distributivity of verb predication.
This dataset is a collection of 5348 links from bug-introducing and bug-fixing commit sets extracted from Mozilla's Bugzilla with the use of bugbug. In this repository, you will find two shapes of it:
WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models for video question answering.
We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.
a fine-grained corpus to detect, identify and correct the chinese grammatical errors. collected mainly from multi-choice questions in public school Chinese examinations with multiple references Online Evaluation Site for test set: https://codalab.lisn.upsaclay.fr/competitions/8020
AesVQA is a dataset that contains 72168 high-quality images and 324756 pairs of aesthetic questions. This dataset addresses the task of aesthetic VQA and introduces subjectiveness into VQA tasks.
ChiQA is a dataset designed for visual question answering tasks that not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. It contains more than 40K questions and more than 200K question-images pairs. The questions are real-world image-independent queries that are more various and unbiased.
The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multichannel naturalistic data resource. As an initial step to motivate a stream-lined and collaborative effort from the speech and language community, UTDallas-CRSS is hosting a series of progressively complex tasks to promote advanced research on naturalistic “Big Data” corpora. This began with ISCA INTERSPEECH-2019: "The FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-#1)". This first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the “First Step” towards extracting high-level information from such massive unlabeled corpora. As a natural progression following the successful Inaugural Challenge FS#1, the FEARLESS
CC-Riddle is a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, the authors provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CCRiddle is composed of both human-written riddles and filtered generated riddle.
Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.
The Multi-domain Image Characteristic Dataset consists of thousands of images sourced from the internet. Each image falls under one of three domains - animals, birds, or furniture. There are five types under each domain. There are 200 images of each type, summing up the total dataset to 3,000 images. The master file consists of two columns; the image name and the visible characteristics of that image. Every image was manually analyzed and the characteristics for each image were generated, ensuring accuracy.
This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs.