19,997 machine learning datasets
19,997 dataset results
We introduce a large and comprehensive dataset to facilitate the study of several essential AM tasks in the debating system. In our work, we first review the existing subtasks (claim extraction, stance classification, evidence extraction), and then propose two integrated argument mining tasks: claim extraction with stance classification (CESC) and claim-evidence pair extraction (CEPE).
The dataset has been designed to represent true web videos in the wild, with good visual quality and diverse content characteristics, The test video collection for TRECVID-AVS2019-TRECVID-AVS2021, which contains 1,082,649 web video clips, with even more diverse content, no predominant characteristics and low self-similarity.
Spoken versions of the Semantic Textual Similarity dataset for testing semantic sentence level embeddings. Contains thousands of sentence pairs annotated by humans for semantic similarity. The spoken sentences can be used in sentence embedding models to test whether your model learns to capture sentence semantics. All sentences available in 6 synthetic Wavenet voices and a subset (5%) in 4 real voices recorded in a sound attenuated booth. Code to train a visually grounded spoken sentence embedding model and evaluation code is available at https://github.com/DannyMerkx/speech2image/tree/Interspeech21
NKL (short for NanKai Lines) is a dataset for semantic line detection. Semantic lines are meaningful line structures that outline the conceptual structure of natural images. The NKL dataset contains 5,000 images of various scenes. Each of these images is annotated by multiple skilled human annotators. The dataset is split into training and validation subsets. There are 4,000 images in the training set and 1,000 in the validation set.
SLNET is collection of third party Simulink models. It is curated via mining open source repository (GitHub and Matlab Central) using SLNET-Miner (https://github.com/50417/SLNet_Miner).
Open Dataset: Mobility Scenario FIMU
Due to the highly variable sample size of the original BirdClef2020 dataset and the issues that it presents with reproducibility, we propose a pruned version of the set, where samples longer than 180s are removed along with classes with fewer than 50 samples. This processing puts it further in line with other complex audio datasets and allows for experimentation on more consumer friendly hardware.
Paper Abstract
2.2K neutral sentences from Wikipedia 1.7K additionally labeled sentences generated by the Human-in-the-Loop procedure (based on Korean Unsmile Dataset Base Model) 7.1K rule-generated neutral sentences
This corpus is an annotation of the novel The Little Prince by Antoine de Saint-Exupéry, published in 1943. We were inspired by the UNL project to include this novel, so that different groups could compare representations on the same text.
For a detailed description, we refer to Section 3 in our research article.
GFP-GOWT1 mouse stem cells
HeLa cells stably expressing H2b-GFP
These files are supplementary material for “Generalized Seismic Phase Detection with Deep Learning” by Ross et al. (2018), BSSA (doi.org/10.1785/0120180080). The models were trained using keras and TensorFlow, and can be used with these libraries. The training dataset contains 4.5 million seismograms evenly split between P-waves, S-waves, and pre-event noise classes. We encourage the use of this hdf5 dataset for training deep learning models, and hope that it and the model architecture in the paper can serve as a benchmark for future studies. For additional information please contact Zachary Ross (zross@caltech.edu).
This is the first image-based network intrusion detection dataset. This large-scale dataset included network traffic protocol communication-based images from 15 different observation locations of different countries in Asia. This dataset is used to identify two different types of anomalies from benign network traffic. Each image with a size of 48 × 48 contains multi-protocol communications within 128 seconds. The SIDD dataset can be to applied to a broad range of tasks such as machine learning-based network intrusion detection, non-iid federated learning, and so forth.
The CER initiated the Smart Metering Project in 2007 with the purpose of undertaking trials to assess the performance of Smart Meters, their impact on consumers’ energy consumption and the economic case for a wider national rollout. It is a collaborative energy industry-wide project managed by the CER and actively involving energy industry participants including the Sustainable Energy Authority of Ireland (SEAI), the Department of Communications, Energy and Natural Resources (DCENR), ESB Networks, Bord Gáis Networks, Electric Ireland, Bord Gáis Energy and other energy suppliers.
Detecting out-of-context media, such as "mis-captioned" images on Twitter, is a relevant problem, especially in domains of high public significance. Twitter-COMMs is a large-scale multimodal dataset with 884k tweets relevant to the topics of Climate Change, COVID-19, and Military Vehicles. This dataset can be used to develop methods to detect misinformation on social media platforms related to these three topics.
ErAConD is a novel GEC dataset consisting of parallel original and corrected utterances drawn from open-domain chatbot conversations.
Contains over 70,000 question-answer pairs from both structured tables and unstructured notes from a publicly available Electronic Health Record (EHR).
A novel dataset of document-grounded task-based dialogues, where an Information Giver (IG) provides instructions (by consulting a document) to an Information Follower (IF), so that the latter can successfully complete the task. In this unique setting, the IF can ask clarification questions which may not be grounded in the underlying document and require commonsense knowledge to be answered.