Datasets

19,997 machine learning datasets

19,997 dataset results

3MASSIV

A multilingual, multimodal and multi-aspect, expertly-annotated dataset of diverse short videos extracted from short-video social media platform - Moj. 3MASSIV comprises of 50k short videos (~20 seconds average duration) and 100K unlabeled videos in 11 different languages and captures popular short video trends like pranks, fails, romance, comedy expressed via unique audio-visual formats like self-shot videos, reaction videos, lip-synching, self-sung songs, etc.

2 papers0 benchmarksVideos

PeerSum

PeerSum is a new MDS dataset using peer reviews of scientific publications. The dataset differs from the existing MDS datasets in that summaries (i.e., the meta-reviews) are highly abstractive and they are real summaries of the source documents.

2 papers0 benchmarksTexts

Visual Affordance Learning

A large-scale multi-view RGBD visual affordance learning dataset, a benchmark of 47210 RGBD images from 37 object categories, annotated with 15 visual affordance categories and 35 cluttered/complex scenes with different objects and multiple affordances. To the best of our knowledge, this is the first ever and the largest multi-view RGBD visual affordance learning dataset.

2 papers0 benchmarksImages

PETCI (PETCI: A Parallel English Translation Dataset of Chinese Idioms)

PETCI is a Parallel English Translation dataset of Chinese Idioms, collected from an idiom dictionary and Google and DeepL translation. PETCI contains 4,310 Chinese idioms with 29,936 English translations. These translations capture diverse translation errors and paraphrase strategies.

2 papers0 benchmarksTexts

Biographical (Biographical: A Semi-Supervised Relation Extraction Dataset)

Biographical is a semi-supervised dataset for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata.

2 papers0 benchmarksGraphs, Texts

ORCAS-I (Queries Annotated with Intent using Weak Supervision)

A labelled version of the ORCAS click-based dataset of Web queries, which provides 18 million connections to 10 million distinct queries.

2 papers3 benchmarksTexts

CareCall (CareCall for Seniors)

carecall is a Korean dialogue dataset for role-satisfying dialogue systems. The dataset was composed with a few samples of human-written dialogues using in-context few-shot learning of large-scale LMs. Large-scale LMs can generate dialogues with a specific personality, given a prompt consisting of a brief description of the chatbot’s properties and few dialogue examples. We use this method to build the entire dataset.

2 papers0 benchmarksTexts

CAVES (A Dataset to facilitate Explainable Classification and Summarization of Concerns towards COVID Vaccines)

CAVES is the first large-scale dataset containing about 10k COVID-19 anti-vaccine tweets labelled into various specific anti-vaccine concerns in a multi-label setting. This is also the first multi-label classification dataset that provides explanations for each of the labels. Additionally, the dataset also provides class-wise summaries of all the tweets.

2 papers0 benchmarksTexts

DRACO20K

DRACO20K dataset is used for evaluating object canonicalization on methods that estimate a canonical frame from a monocular input image.

2 papers0 benchmarks3D, Images, RGB-D

CEREBRUM-7T (Fast and Fully-volumetric Brain Segmentation of 7 Tesla MR Volumes)

Ultra-high field MRI enables sub-millimetre resolution imaging of human brain, allowing to disentangle complex functional circuits across different cortical depths. Segmentation, meant as the partition of MR brain images in multiple anatomical classes, is an essential step in many functional and structural neuroimaging studies. In this work, we design and test CEREBRUM-7T, an optimised end-to-end CNN architecture, that allows to segment a whole 7T T1w MRI brain volume at once, without the need of partitioning it into 2D or 3D tiles. Despite deep learning (DL) methods are recently starting to emerge in 3T literature, to the best of our knowledge, CEREBRUM-7T is the first example of DL architecture directly applied on 7T data. Training is performed in a weakly supervised fashion, since it exploits a ground-truth (GT) with errors. The generated model is able to produce accurate multi-structure segmentation masks on six different classes, in only few seconds. In the experimental part, we s

2 papers0 benchmarksMRI

SuMe (A Dataset Towards Summarizing Biomedical Mechanisms)

Can language models read biomedical texts and explain the biomedical mechanisms discussed? In this work we introduce a biomedical mechanism summarization task. Biomedical studies often investigate the mechanisms behind how one entity (e.g., a protein or a chemical) affects another in a biological context. The abstracts of these publications often include a focused set of sentences that present relevant supporting statements regarding such relationships, associated experimental evidence, and a concluding sentence that summarizes the mechanism underlying the relationship. We leverage this structure and create a summarization task, where the input is a collection of sentences and the main entities in an abstract, and the output includes the relationship and a sentence that summarizes the mechanism. Using a small amount of manually labeled mechanism sentences, we train a mechanism sentence classifier to filter a large biomedical abstract collection and create a summarization dataset with 2

2 papers0 benchmarks

AVCAffe (A Large Scale Audio-Visual Dataset of Cognitive Load and Affect for Remote Work)

We introduce AVCAffe, the first Audio-Visual dataset consisting of Cognitive load and Affect attributes. We record AVCAffe by simulating remote work scenarios over a video-conferencing platform, where subjects collaborate to complete a number of cognitively engaging tasks. AVCAffe is the largest originally collected (not collected from the Internet) affective dataset in English language. We recruit 106 participants from 18 different countries of origin, spanning an age range of 18 to 57 years old, with a balanced male-female ratio. AVCAffe comprises a total of 108 hours of video, equivalent to more than 58,000 clips along with task-based self-reported ground truth labels for arousal, valence, and cognitive load attributes such as mental demand, temporal demand, effort, and a few others. We believe AVCAffe would be a challenging benchmark for the deep learning research community given the inherent difficulty of classifying affect and cognitive load in particular. Moreover, our dataset f

2 papers0 benchmarksAudio, Videos

HeriGraph (Multimodal Machine Learning Datasets on Graphs of Heritage Values and Attributes)

The dataset contains constructed multi-modal features (visual and textual), pseudo-labels (on heritage values and attributes), and graph structures (with temporal, social, and spatial links) constructed using User-Generated Content data collected from Flickr social media platform in three global cities containing UNESCO World Heritage property (Amsterdam, Suzhou, Venice). The motivation of data collection in this project is to provide datasets that could be both directly applicable for ML communities as test-bed, and theoretically informative for heritage and urban scholars to draw conclusions on for planning decision-making.

2 papers0 benchmarksEnvironment, Graphs, Images, Texts

Korpus Malti

General Corpora for the Maltese Language.

2 papers0 benchmarksTexts

TBBR (Thermal Bridges on Building Rooftops)

The dataset of Thermal Bridges on Building Rooftops (TBBR dataset) consists of annotated combined RGB and thermal drone images with a height map. All images were converted to a uniform format of 3000$\times$4000 pixels, aligned, and cropped to 2400$\times$3400 to remove empty borders.

2 papers6 benchmarksHyperspectral images, Images, RGB-D

Telegraphic Summaries (Gold Corpus for Telegraphic Summarization)

README Created by Malireddy Chanakya & Srivenkata N Mounika Somisetty & Malireddy Chaitanya

2 papers0 benchmarks

BreastClassifications4 ([MIMBCD-UI] UTA4: Severity & Pathology Classifications Dataset)

Several datasets are fostering innovation in higher-level functions for everyone, everywhere. By providing this repository, we hope to encourage the research community to focus on hard problems. In this repository, we present the real results severity (BIRADS) and pathology (post-report) classifications provided by the Radiologist Director from the Radiology Department of Hospital Fernando Fonseca while diagnosing several patients (see dataset-uta4-dicom) from our User Tests and Analysis 4 (UTA4) study. Here, we provide a dataset for the measurements of both severity (BIRADS) and pathology classifications concerning the patient diagnostic. Work and results are published on a top Human-Computer Interaction (HCI) conference named AVI 2020 (page). Results were analyzed and interpreted from our Statistical Analysis charts. The user tests were made in clinical institutions, where clinicians diagnose several patients for a Single-Modality vs Multi-Modality comparison. For example, in these t

2 papers0 benchmarksBiomedical, Images, Medical, Tabular

FIJO (French Insurance Job Offer dataset)

This dataset was collected as part of the multidisciplinary project Femmes face aux défis de la transformation numérique : une étude de cas dans le secteur des assurances (Women Facing the Challenges of Digital Transformation: A Case Study in the Insurance Sector) at Université Laval, funded by the Future Skills Centre. It includes job offers, in French, from insurance companies between 2009 and 2020.

2 papers0 benchmarksTexts

FeedbackQA

📄 Read 💾 Code 🔗 Webpage 💻 Demo 🤗 Huggingface Dataset 💬 Discussions

2 papers0 benchmarksTexts

satp-zsm-stage1 (Replication Data for: Crossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to zsm)

This is the replication data for the paper: "Crossing the Linguistic Causeway: A Binational Approach for Translating Soundscape Attributes to Bahasa Melayu".

2 papers0 benchmarksTexts

PreviousPage 325 of 1000Next