Datasets

3,148 machine learning datasets

3,148 dataset results

Reflective essays on CS TA experience

Teaching assistants (TAs) are heavily used in computer science courses as a way to handle high enrollment and still being able to offer students individual tutoring and detailed assessments. This data is the result of a multi-institutional, multi-national perspective of challenges that TAs in computer science face. 180 reflective essays written by TAs from three institutions across Europe were analyzed and coded. The thematic analysis resulted in five main challenges: becoming a professional TA, student-focused challenges, assessment, defining and using best practice and threats to best practice. In addition, these challenges were all identified within the essays from all three institutions, indicating that the identified challenges are not particularly context-dependent. (2021-04-11)

1 papers0 benchmarksTabular, Texts

Bengali Ekman's Six Basic Emotions Corpus

The dataset contains 36000 Bangla data based on Ekman's six basic emotions. This data was first introduced in the paper Alternative non-BERT model choices for the textual classification in low-resource languages and environments. The whole dataset is balanced and evenly distributed among all the six classes.

1 papers1 benchmarksTexts

Capriccio (Sentiment Analysis + Data Drift)

Capriccio is a sentiment classification dataset on tweets that simulates data drift. It is created by slicing the Sentiment140 dataset (homepage, Huggingface datasets) with a sliding window of 500,000 tweets, resulting in 38 slices. Thus, each slice can be used to represent the training/validation dataset of a sentiment classification model that is re-trained every day. Each slice has 425,000 tweets for training (file named %d_train.json) and 75,000 tweets for validation (file named %d_val.json).

1 papers0 benchmarksTexts

SentimentArcs: Sentiment Reference Corpus for Novels

SentimentArcs’ reference corpus for novels consists of 25 narratives selected to create a diverse set of well recognized novels that can serve as a benchmark for future studies. The composition of the corpora was limited by the effect of copyright laws as well as historical imbalances. Most works were obtained from US and Australian Gutenberg Projects. The corpora is expected to grow in size and diversity over time.

1 papers0 benchmarksTexts

CTI-to-MITRE

This dataset contains samples of CTI (Cyber Threat Intelligence) data in natural language, labeled with the corresponding adversarial techniques from the MITRE ATT&CK framework.

1 papers0 benchmarksTexts

RL-ISN-dataset

The datasets of "Reinforcement Learning-enhanced Shared-account Cross-domain Sequential Recommendation" (TKDE 2022)

1 papers0 benchmarksTexts

tida-gcn-data

The datasets of "Time Interval-enhanced Graph Neural Network for Shared-account Cross-domain Sequential Recommendation" (TNNLs 2022)

1 papers0 benchmarksTexts

GeBiD (Geometric shapes Bimodal Dataset)

We provide a custom synthetic bimodal dataset, called GeBiD, designed specifically for the comparison of the joint- and cross-generative capabilities of Multimodal Variational Autoencoders. It comprises RGB images of geometric primitives and textual descriptions. The dataset offers 5 levels of difficulty (based on the number of attributes) to find the minimal functioning scenario for each model. Moreover, its rigid structure enables automatic qualitative evaluation of the generated samples.

1 papers0 benchmarksImages, Texts

DistNLI

This dataset is named as the DistNLI dataset, which is a synthesized benchmark aiming to probe neural network models from the aspect of conjunctions on distributivity in NLI task in American English. DistNLI consists of sentence minimal pairs (premise and hypothesis) differentiated with conjunction structure within the pair and distributivity-related linguistic phenomenon. DistNLI is compiled with 328 sentences so far (164 for distributive and 164 for ambiguous predicates), annotated by 4 proficient English speakers with a background in NLP and Linguistics. Due to the specificity of the linguistic phenomenon involved and its size, this DistNLI dataset should only be used as an adversarial dataset in the investigation of distributivity of verb predication.

1 papers0 benchmarksTexts

Regressors-Regressions Dataset

This dataset is a collection of 5348 links from bug-introducing and bug-fixing commit sets extracted from Mozilla's Bugzilla with the use of bugbug. In this repository, you will find two shapes of it:

1 papers0 benchmarksTexts

WildQA

WildQA is a video understanding dataset of videos recorded in outside settings. The dataset can be used to evaluate models for video question answering.

1 papers3 benchmarksImages, Texts

CSL (Chinese Scientific Literature)

We present CSL, a large-scale Chinese Scientific Literature dataset, which contains the titles, abstracts, keywords and academic fields of 396,209 papers. To our knowledge, CSL is the first scientific document dataset in Chinese.

1 papers0 benchmarksTexts

FCGEC (FCGEC: Fine-Grained Corpus for Chinese Grammatical Error Correction)

a fine-grained corpus to detect, identify and correct the chinese grammatical errors. collected mainly from multi-choice questions in public school Chinese examinations with multiple references Online Evaluation Site for test set: https://codalab.lisn.upsaclay.fr/competitions/8020

1 papers2 benchmarksTexts

AesVQA

AesVQA is a dataset that contains 72168 high-quality images and 324756 pairs of aesthetic questions. This dataset addresses the task of aesthetic VQA and introduces subjectiveness into VQA tasks.

1 papers0 benchmarksImages, Texts

ChiQA (Chinese VQA)

ChiQA is a dataset designed for visual question answering tasks that not only measures the relatedness but also measures the answerability, which demands more fine-grained vision and language reasoning. It contains more than 40K questions and more than 200K question-images pairs. The questions are real-world image-independent queries that are more various and unbiased.

1 papers0 benchmarksImages, Texts

FSC-P2 (Fearless Steps Challenge Phase2)

The Fearless Steps Initiative by UTDallas-CRSS led to the digitization, recovery, and diarization of 19,000 hours of original analog audio data, as well as the development of algorithms to extract meaningful information from this multichannel naturalistic data resource. As an initial step to motivate a stream-lined and collaborative effort from the speech and language community, UTDallas-CRSS is hosting a series of progressively complex tasks to promote advanced research on naturalistic “Big Data” corpora. This began with ISCA INTERSPEECH-2019: "The FEARLESS STEPS Challenge: Massive Naturalistic Audio (FS-#1)". This first edition of this challenge encouraged the development of core unsupervised/semi-supervised speech and language systems for single-channel data with low resource availability, serving as the “First Step” towards extracting high-level information from such massive unlabeled corpora. As a natural progression following the successful Inaugural Challenge FS#1, the FEARLESS

1 papers0 benchmarksAudio, Texts

CC-Riddle (Chinese Character Riddle)

CC-Riddle is a Chinese character riddle dataset covering the majority of common simplified Chinese characters by crawling riddles from the Web and generating brand new ones. In the generation stage, the authors provide the Chinese phonetic alphabet, decomposition and explanation of the solution character for the generation model and get multiple riddle descriptions for each tested character. Then the generated riddles are manually filtered and the final dataset, CCRiddle is composed of both human-written riddles and filtered generated riddle.

1 papers0 benchmarksTexts

Trailers12k

Trailers12k is a movie trailer dataset comprised of 12,000 titles associated to ten genres. It distinguishes from other datasets by its collection procedure aimed at providing a high-quality publicly available dataset.

1 papers0 benchmarksImages, Texts, Videos

Multi-domain Image Characteristics Dataset

The Multi-domain Image Characteristic Dataset consists of thousands of images sourced from the internet. Each image falls under one of three domains - animals, birds, or furniture. There are five types under each domain. There are 200 images of each type, summing up the total dataset to 3,000 images. The master file consists of two columns; the image name and the visible characteristics of that image. Every image was manually analyzed and the characteristics for each image were generated, ensuring accuracy.

1 papers0 benchmarksImages, Texts

ACL Anthology Corpus with Full Text

This repository provides full-text and metadata to the ACL anthology collection (80k articles/posters as of September 2022) also including .pdf files and grobid extractions of the pdfs.

1 papers0 benchmarksTexts

PreviousPage 120 of 158Next