Datasets

19,997 machine learning datasets

19,997 dataset results

Cyberbullying Classification

As social media usage becomes increasingly prevalent in every age group, a vast majority of citizens rely on this essential medium for day-to-day communication. Social media’s ubiquity means that cyberbullying can effectively impact anyone at any time or anywhere, and the relative anonymity of the internet makes such personal attacks more difficult to stop than traditional bullying.

2 papers0 benchmarks

Dynamic OLAT Dataset (ShanghaiTech MARS Dynamic OLAT Dataset)

To provide ground truth supervision for video consistency modeling, we build up a high-quality dynamic OLAT dataset. Our capture system consists of a light stage setup with 114 LED light sources and Phantom Flex4K-GS camera (global shutter, stationary 4K ultra-high-speed camera at 1000 fps), resulting in dynamic OLAT imageset recording at 25 fps using the overlapping method. Our dynamic OLAT dataset provides sufficient semantic, temporal and lighting consistency supervision to train our neural video portrait relighting scheme, which can generalize to in-the-wild scenarios.

2 papers0 benchmarksImages, RGB Video, Videos

NMED-T (Naturalistic Music EEG Dataset - Tempo)

Losorelli, Steven, Nguyen, Duc T., Dmochowski, Jacek P., and Kaneshiro, Blair

2 papers0 benchmarks

LAW (The Laboratory for Web Algorithmics)

The Laboratory for Web Algorithmics (LAW) was established in 2002 at the Dipartimento di Scienze dell'Informazione (now merged in the Computer Science Department) of the Università degli studi di Milano.

2 papers0 benchmarks

ASOS Digital Experiments Dataset

A novel dataset that can support the end-to-end design and running of Online Controlled Experiments (OCE) with adaptive stopping.

2 papers0 benchmarks

CoVaxLies v2

CoVaxLies v2 includes 47 Misinformation Targets (MisTs) found on Twitter about the COVID-19 vaccines. Language experts annotated tweets as Relevant or Not Relevant, and then further annotated Relevant tweets with Stance towards each MisT. This collection is a first step in providing large-scale resources for misinformation detection and misinformation stance identification.

2 papers0 benchmarksTexts

Study data

Challenges in Migrating Imperative Deep Learning Programs to Graph Execution: An Empirical Study File Descriptions File | Description --- | --- commit_categorizations.csv | Categorizations for the commits in our dataset. commits.csv | Information for the commits in our dataset datasets.csv | Contains the names and descriptions of our datasets. issue_categorizations.csv | Categorizations for the chosen issues from our dataset. issues.csv | Information for the issues in our dataset. pipeline_stages.csv | DL pipeline stages and their respective descriptions. problem_categories.csv | Problem categories and their respective descriptions. problem_causes.csv | Problem causes and their respective descriptions. problem_fixes.csv | Problem fixes and their respective descriptions. problem_symptoms.csv | Problem symptoms and their respective descriptions. studied_subjects_commits.csv | Project data for commits. studied_subjects_issues.csv | Project data for issues.

2 papers0 benchmarksTexts

Iconary

Iconary dataset is for testing multimodal communication with drawings and text.

2 papers0 benchmarksImages, Texts

2D Moving Clusters

Contains $10^7$ points, sampled from 20 clusters, with incremental concept drift - On each batch (of size 1000) the mean of each of the clusters moves a random (small) length in some random direction, the means move independently of each other. This dataset should be used sequentially, in batches of $1000$.

2 papers0 benchmarks

EasyCall corpus

EasyCall corpus is a dysarthric speech command dataset in Italian. The dataset consists of 21386 audio recordings from 24 healthy and 31 dysarthric speakers, whose individual degree of speech impairment was assessed by neurologists through the Therapy Outcome Measure.

2 papers0 benchmarks

Intel Lab Data

This dataset contains data collected from 54 sensors deployed in the Intel Berkeley Research lab between February 28th and April 5th, 2004. Mica2Dot sensors with weatherboards collected timestamped topology information, along with humidity, temperature, light, and voltage values once every 31 seconds. Data was collected using the TinyDB in-network query processing system, built on the TinyOS platform.

2 papers0 benchmarks

CRC100K (100,000 histological images of human colorectal cancer and healthy tissue)

This is a set of 100,000 non-overlapping image patches from hematoxylin & eosin (H&E) stained histological images of human colorectal cancer (CRC) and normal tissue. All images are 224x224 pixels (px) at 0.5 microns per pixel (MPP). For tissue classification; the classes are: Adipose (ADI), background (BACK), debris (DEB), lymphocytes (LYM), mucus (MUC), smooth muscle (MUS), normal colon mucosa (NORM), cancer-associated stroma (STR), colorectal adenocarcinoma epithelium (TUM). The images were manually extracted from N=86 H&E stained human cancer tissue slides from formalin-fixed paraffin-embedded (FFPE) samples from the NCT Biobank (National Center for Tumor Diseases, Heidelberg, Germany) and the UMM pathology archive (University Medical Center Mannheim, Mannheim, Germany). Tissue samples contained CRC primary tumor slides and tumor tissue from CRC liver metastases; normal tissue classes were augmented with non-tumorous regions from gastrectomy specimen to increase variability.

2 papers0 benchmarksBiomedical, Images

NTU-X

NTU-X is an extended version of popular NTU dataset.

2 papers0 benchmarks

I.PHI

I.PHI processes the Packard Humanities Institute (PHI) database of ancient Greek inscriptions including the geographical and chronological metadata into a machine actionable format. The processed dataset is referred to as I.PHI.

2 papers6 benchmarksImages

Cross-View Time Dataset

The appearance of the world varies dramatically not only from place to place but also from hour to hour and month to month. Every day billions of images capture this complex relationship, many of which are associated with precise time and location metadata. We propose to use these images to construct a global-scale, dynamic map of visual appearance attributes. Such a map enables fine-grained understanding of the expected appearance at any geographic location and time. Our approach integrates dense overhead imagery with location and time metadata into a general framework capable of mapping a wide variety of visual attributes. A key feature of our approach is that it requires no manual data annotation. We demonstrate how this approach can support various applications, including image-driven mapping, image geolocalization, and metadata verification.

2 papers2 benchmarksImages

K-SportsSum

K-SportsSum is a sports game summarization dataset with two characteristics: (1) K-SportsSum collects a large amount of data from massive games. It has 7,854 commentary-news pairs. To improve the quality, K-SportsSum employs a manual cleaning process; (2) Different from existing datasets, to narrow the knowledge gap, K-SportsSum further provides a large-scale knowledge corpus that contains the information of 523 sports teams and 14,724 sports players.

2 papers0 benchmarksTexts

Nations

The Nations dataset is a small knowledge graph with 14 entities, 55 relations, and 1992 triples describing countries and their political relationships. This dataset is available for download from https://github.com/ZhenfengLei/KGDatasets.

2 papers0 benchmarksGraphs

PET: A new Dataset for Process Extraction from Natural Language Text

The dataset contains 45 documents containing narrative description of business process and their annotations. Annotated with activities, gateways, actors, and flow information.

2 papers0 benchmarksTexts

Heritage Health Prize

Heritage Provider Network is providing Competition Entrants with deidentified member data collected during a forty-eight month period that is allocated among three data sets (the "Data Sets"). Competition Entrants will use the Data Sets to develop and test their algorithms for accurately predicting the number of days that the members will spend in a hospital (inpatient or emergency room visit) during the 12-month period following the Data Set cut-off date.

2 papers0 benchmarks

Chest x-ray landmark dataset

Set of landmark annotations for JSRT, Montgomery, Shenzhen and a subset of Padchest datasets

2 papers0 benchmarksMedical

PreviousPage 322 of 1000Next