Datasets

19,997 machine learning datasets

19,997 dataset results

IAM Dataset (A Comprehensive and Large-Scale Dataset for Integrated Argument Mining Tasks)

We introduce a large and comprehensive dataset to facilitate the study of several essential AM tasks in the debating system. In our work, we first review the existing subtasks (claim extraction, stance classification, evidence extraction), and then propose two integrated argument mining tasks: claim extraction with stance classification (CESC) and claim-evidence pair extraction (CEPE).

3 papers6 benchmarksTexts

TRECVID-AVS19 (V3C1)

The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, The test video collection for TRECVID-AVS2019-TRECVID-AVS2021, which contains 1,082,649 web video clips, with even more diverse content, no predominant characteristics and low self-similarity.

3 papers1 benchmarksTexts, Videos

SpokenSTS

Spoken versions of the Semantic Textual Similarity dataset for testing semantic sentence level embeddings. Contains thousands of sentence pairs annotated by humans for semantic similarity. The spoken sentences can be used in sentence embedding models to test whether your model learns to capture sentence semantics. All sentences available in 6 synthetic Wavenet voices and a subset (5%) in 4 real voices recorded in a sound attenuated booth. Code to train a visually grounded spoken sentence embedding model and evaluation code is available at https://github.com/DannyMerkx/speech2image/tree/Interspeech21

3 papers0 benchmarksSpeech

NKL

NKL (short for NanKai Lines) is a dataset for semantic line detection. Semantic lines are meaningful line structures that outline the conceptual structure of natural images. The NKL dataset contains 5,000 images of various scenes. Each of these images is annotated by multiple skilled human annotators. The dataset is split into training and validation subsets. There are 4,000 images in the training set and 1,000 in the validation set.

3 papers1 benchmarksImages

SLNET (SLNET: A Redistributable Corpus of 3rd-party Simulink Models)

SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).

3 papers0 benchmarksGraphs

MS-FIMU (Mobility Scenario FIMU)

Open Dataset: Mobility Scenario FIMU

3 papers0 benchmarksTabular

BirdClef 2020 (Pruned)

Due to the highly variable sample size of the original BirdClef2020 dataset and the issues that it presents with reproducibility, we propose a pruned version of the set, where samples longer than 180s are removed along with classes with fewer than 50 samples. This processing puts it further in line with other complex audio datasets and allows for experimentation on more consumer friendly hardware.

3 papers2 benchmarksAudio

IJB-S (IARPA Janus Benchmark-S)

Paper Abstract

3 papers18 benchmarks

HateScore (HateScore : Human-in-the-Loop and Neutral Korean Multi-label Online Hate Speech Dataset)

2.2K neutral sentences from Wikipedia 1.7K additionally labeled sentences generated by the Human-in-the-Loop procedure (based on Korean Unsmile Dataset Base Model) 7.1K rule-generated neutral sentences

3 papers0 benchmarksTexts

The Little Prince (The Little Prince Corpus)

This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were inspired by the UNL project to include this novel, so that different groups could compare representations on the same text.

3 papers2 benchmarksGraphs, Texts

AnthroProtect

For a detailed description, we refer to Section 3 in our research article.

3 papers0 benchmarksImages

Fluo-N2DH-GOWT1

GFP-GOWT1 mouse stem cells

3 papers4 benchmarksImages

Fluo-N2DL-HeLa

HeLa cells stably expressing H2b-GFP

3 papers4 benchmarks

Southern California Seismic Network Data (Generalized Seismic Phase Detection with Deep Learning)

These files are supplementary material for “Generalized Seismic Phase Detection with Deep Learning” by Ross et al. (2018), BSSA (doi.org/10.1785/0120180080). The models were trained using keras and TensorFlow, and can be used with these libraries. The training dataset contains 4.5 million seismograms evenly split between P-waves, S-waves, and pre-event noise classes. We encourage the use of this hdf5 dataset for training deep learning models, and hope that it and the model architecture in the paper can serve as a benchmark for future studies. For additional information please contact Zachary Ross (zross@caltech.edu).

3 papers1 benchmarks

SIDD-Image (Segmented Intrusion Detection Dataset)

This is the first image-based network intrusion detection dataset. This large-scale dataset included network traffic protocol communication-based images from 15 different observation locations of different countries in Asia. This dataset is used to identify two different types of anomalies from benign network traffic. Each image with a size of 48 × 48 contains multi-protocol communications within 128 seconds. The SIDD dataset can be to applied to a broad range of tasks such as machine learning-based network intrusion detection, non-iid federated learning, and so forth.

3 papers1 benchmarksImages

CER Smart Metering Project - Electricity Customer Behaviour Trial (CER Smart Metering Project - Electricity Customer Behaviour Trial, 2009-2010. Irish Social Science Data Archive. SN: 0012-00)

The CER initiated the Smart Metering Project in 2007 with the purpose of undertaking trials to assess the performance of Smart Meters, their impact on consumers’ energy consumption and the economic case for a wider national rollout. It is a collaborative energy industry-wide project managed by the CER and actively involving energy industry participants including the Sustainable Energy Authority of Ireland (SEAI), the Department of Communications, Energy and Natural Resources (DCENR), ESB Networks, Bord Gáis Networks, Electric Ireland, Bord Gáis Energy and other energy suppliers.

3 papers0 benchmarks

Twitter-COMMs

Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance. Twitter-COMMs is a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. This dataset can be used to develop methods to detect misinformation on social media platforms related to these three topics.

3 papers0 benchmarksImages, Texts

ErAConD (Error Annotated Conversational Dialog Dataset for Grammatical Error Correction)

ErAConD is a novel GEC dataset consisting of parallel original and corrected utterances drawn from open-domain chatbot conversations.

3 papers0 benchmarksTexts

DrugEHRQA (Electronic Health Record QA)

Contains over 70,000 question-answer pairs from both structured tables and unstructured notes from a publicly available Electronic Health Record (EHR).

3 papers0 benchmarks

Task2Dial

A novel dataset of document-grounded task-based dialogues, where an Information Giver (IG) provides instructions (by consulting a document) to an Information Follower (IF), so that the latter can successfully complete the task. In this unique setting, the IF can ask clarification questions which may not be grounded in the underlying document and require commonsense knowledge to be answered.

3 papers0 benchmarksTexts

PreviousPage 276 of 1000Next