Datasets

19,997 machine learning datasets

19,997 dataset results

Secim2023

Secim2023 is a comprehensive dataset for social media researchers to study the upcoming election, develop tools to prevent online manipulation, and gather novel information to inform the public.

3 papers0 benchmarksTexts

ORU Diverse radar dataset

Evaluate radar localization in diverse environments Download: https://drive.google.com/drive/folders/1uATfrAe-KHlz29e-Ul8qUbUKwPxBFIhP Download

3 papers0 benchmarksLiDAR

TCAB (Text Classification Attack Benchmark)

Text Classification Attack Benchmark (TCAB) is a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes 1.5 million attack instances, generated by twelve adversarial attack targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. The process of generating attacks is automated, so that TCAB can easily be extended to incorporate new text attacks and better classifiers as they are developed.

3 papers0 benchmarksTexts

Physionet MI (Physionet EEG Motor Movement/Imagery Dataset)

This data set consists of over 1500 one- and two-minute EEG recordings, obtained from 109 volunteers [2].

3 papers0 benchmarksEEG

Geoclidean-Constraints

Geoclidean-Constraints dataset consists of 20 concepts and 40 tasks, created from permutations of line and circle construction rules with various constraints describing the relationship between objects. This dataset focuses on explicit constraints between geometric objects. We denote the objects as the following—lines as L, circles as C, and triangles (constructed from three lines) as T.

3 papers0 benchmarksTexts

TyDiP (A Dataset for Politeness Classification in Nine Typologically Diverse Languages)

A Dataset for Politeness Classification in Nine Typologically Diverse Languages (TyDiP) is a dataset containing three-way politeness annotations for 500 examples in each language, totaling 4.5K examples.

3 papers0 benchmarksTexts

ArzEn (Corpus of Egyptian Arabic-English Code-switching)

Corpus of Egyptian Arabic-English Code-switching (ArzEn) is a spontaneous conversational speech corpus, obtained through informal interviews held at the German University in Cairo. The participants discussed broad topics, including education, hobbies, work, and life experiences. The corpus currently contains 12 hours of speech, having 6,216 utterances. The recordings were transcribed and translated into monolingual Egyptian Arabic and monolingual English.

3 papers0 benchmarksTexts

NarraSum

NarraSum is a large-scale narrative summarization dataset. It contains 122K narrative documents, which are collected from plot descriptions of movies and TV episodes with diverse genres, and their corresponding abstractive summaries.

3 papers0 benchmarksTexts

Wireless AI Research Dataset

Wireless AI Research Dataset is a flexible and easy-to-use dataset with realistic environments designed for various wireless AI tasks. It supports sensing tasks such as localization and environment reconstruction, MIMO tasks such as reflection system and beam-forming, and PHY tasks such as CSI feedback and channel estimation

3 papers0 benchmarks

Tragic Talkers

Tragic Talkers is an audio-visual dataset consisting of excerpts from the "Romeo and Juliet" drama captured with microphone arrays and multiple co-located cameras for light-field video. Tragic Talkers provides ideal content for object-based media (OBM) production. It is designed to cover various conventional talking scenarios, such as monologues, two-people conversations, and interactions with considerable movement and occlusion, yielding 30 sequences captured from a total of 22 different points of view and two 16-element microphone arrays.

3 papers0 benchmarksVideos

TbV Dataset (Trust, but Verify Dataset)

The TbV dataset is large-scale dataset created to allow the community to improve the state of the art in machine learning tasks related to mapping, that are vital for self-driving.

3 papers0 benchmarksLiDAR, Videos

E-NER

E-NER is a publicly available legal Named Entity Recognition (NER) data set. It contains 52 filings from the US SEC EDGAR database. The named entity tags are hand annotated.

3 papers0 benchmarksImages

NusaCrowd

NusaCrowd is a collaborative initiative to collect and unite existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, the authors have has brought together 137 datasets and 117 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their effectiveness has been demonstrated in multiple experiments.

3 papers0 benchmarksSpeech, Texts

CHAIRS dataset

CHAIRS is a large-scale motion-captured f-AHOI dataset, consisting of 17.3 hours of versatile interactions between 46 participants and 81 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions.

3 papers0 benchmarks3D

RoFT (Real or Fake Text)

RoFT is a dataset of 21,000 human annotations of generated text. The task is "Boundary detection" i.e. given a passage that starts off as human written, determine when the text transitions to being machine generated. The dataset also includes error annotations using the taxonomy introduced in the paper. The data can be used to train automatic detection systems, train automatic error correction, analyze visibility of model errors, and compare performance across models. Data was collected using http://roft.io.

3 papers2 benchmarksTexts

TextBox 2.0

TextBox 2.0 is a comprehensive and unified library for text generation, focusing on the use of pre-trained language models (PLMs). The library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs.

3 papers0 benchmarksTexts

MusicNetEM

New refined labels for the MusicNet dataset obtained by the EM process as described in the paper: Ben Maman and Amit Bermano, "Unaligned Supervision for Automatic Music Transcription in The Wild"

3 papers0 benchmarksAudio

SPEC5G

SPEC5G is a dataset for the analysis of natural language specification of 5G Cellular network protocol specification. SPEC5G contains 3,547,587 sentences with 134M words, from 13094 cellular network specifications and 13 online websites. It is designed for security-related text classification and summarisation.

3 papers0 benchmarks

CMMD (The Chinese Mammography Database)

Breast carcinoma is the second largest cancer in the world among women. Early detection of breast cancer has been shown to increase the survival rate, thereby significantly increasing patients' lifespans. Mammography, a noninvasive imaging tool with low cost, is widely used to diagnose breast disease at an early stage due to its high sensitivity. The recent popularization of artificial intelligence in computer-aided diagnosis creates opportunities for advances in areas such as (1) Computer-aided detection for locating suspect lesions such as mass and microcalcification, leaving the classification to the radiologist; and (2) Computer-aided diagnosis for characterizing the suspicious region of lesion and/or estimate its probability of onset; and (3) Findings of predictive image-based biomarkers by applying the computational methods to mine the potential relationships between image representation and molecular subtype, including luminal A, luminal B, HER2 positive, and Triple-negative.

3 papers2 benchmarksMedical

PubMedCite

PubMedCite is a domain-specific dataset with about 192K biomedical scientific papers and a large citation graph preserving 917K citation relationships between them. It is characterized by preserving the salient contents extracted from full texts of references, and the weighted correlation between the salient.

3 papers0 benchmarksBiomedical, Texts

PreviousPage 281 of 1000Next