Datasets

19,997 machine learning datasets

19,997 dataset results

SubjQA

SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).

8 papers0 benchmarksTexts

SynthCity

SynthCity is a 367.9M point synthetic full colour Mobile Laser Scanning point cloud. Every point is assigned a label from one of nine categories.

8 papers0 benchmarks

TalkDown

TalkDown is a labelled dataset for condescension detection in context. The dataset is derived from Reddit, a set of online communities that is diverse in content and tone. The dataset is built from COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending. The dataset contains 3,255 positive (condescend) samples and 3,255 negative ones.

8 papers0 benchmarksTexts

Video Storytelling

A new dataset describing textual stories for events.

8 papers0 benchmarksTexts, Videos

VMSMO

The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.

8 papers0 benchmarksTexts, Videos

Winogender Schemas

Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.

8 papers0 benchmarks

BosphorusSign22k

BosphorusSign22k is a benchmark dataset for vision-based user-independent isolated Sign Language Recognition (SLR). The dataset is based on the BosphorusSign (Camgoz et al., 2016c) corpus which was collected with the purpose of helping both linguistic and computer science communities. It contains isolated videos of Turkish Sign Language glosses from three different domains: Health, finance and commonly used everyday signs. Videos in this dataset were performed by six native signers, which makes this dataset valuable for user independent sign language studies.

8 papers0 benchmarksVideos

VRAI (Vehicle Re-identification for Aerial Image)

VRAI is a large-scale vehicle ReID dataset for UAV-based intelligent applications. The dataset consists of 137, 613 images of 13, 022 vehicle instances. The images of each vehicle instance are captured by cameras of two DJI consumer UAVs at different locations, with a variety of view angles and flight-altitudes (15m to 80m).

8 papers0 benchmarksImages

Thingi10K

Thingi10K is a dataset of 3D-Printing Models. Specifically there are 10,000 models from featured “things” on thingiverse.com, suitable for testing 3D printing techniques such as structural analysis , shape optimization, or solid geometry operations.

8 papers0 benchmarks3D

UNITOPATHO

Histopathological characterization of colorectal polyps allows to tailor patients' management and follow up with the ultimate aim of avoiding or promptly detecting an invasive carcinoma. Colorectal polyps characterization relies on the histological analysis of tissue samples to determine the polyps malignancy and dysplasia grade. Deep neural networks achieve outstanding accuracy in medical patterns recognition, however they require large sets of annotated training images. We introduce UniToPatho, an annotated dataset of 9536 hematoxylin and eosin stained patches extracted from 292 whole-slide images, meant for training deep neural networks for colorectal polyps classification and adenomas grading. The slides are acquired through a Hamamatsu Nanozoomer S210 scanner at 20× magnification (0.4415 μm/px)

8 papers0 benchmarksImages, Medical

MRDA (ICSI Meeting Recorder Dialog Act Corpus)

The MRDA corpus consists of about 75 hours of speech from 75 naturally-occurring meetings among 53 speakers. The tagset used for labeling is a modified version of the SWBD-DAMSL tagset. It is annotated with three types of information: marking of the dialogue act segment boundaries, marking of the dialogue acts and marking of correspondences between dialogue acts.

8 papers0 benchmarksSpeech

hls4ml LHC Jet dataset (hls4ml LHC Jet dataset (100 particles))

Dataset of high-pT jets from simulations of LHC proton-proton collisions

8 papers0 benchmarksPhysics

NuCLS (Nucleus Classification, Localization and Segmentation)

The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from TCGA. These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students using the Digital Slide Archive. These data can be used in several ways to develop and validate algorithms for nuclear detection, classification, and segmentation, or as a resource to develop and evaluate methods for interrater analysis.

8 papers0 benchmarksBiomedical

MISAW (MIcro-Surgical Anastomose Workflow recognition on training sessions)

The MISAW data set is composed of 27 sequences of micro-surgical anastomosis on artificial blood vessels performed by 3 surgeons and 3 engineering students. The dataset contained video, kinematic, and procedural descriptions synchronized at 30Hz. The procedural descriptions contained phases, steps, and activities performed by the participants.

8 papers1 benchmarksMedical, Videos

L3CubeMahaSent

L3CubeMahaSent is a large publicly available Marathi Sentiment Analysis dataset. It consists of marathi tweets which are manually labelled.

8 papers0 benchmarksTexts

LReID

LReID is a benchmark for lifelong person reidentification. It has been built using existing datasets, and it consists of two subsets: LReID-Seen and LReID-Unseen.

8 papers0 benchmarksImages

SenseReID

SenseReID is a person re-identification dataset for evaluating ReID models. It is captured from real surveillance cameras and the person bounding boxes are obtained from state-of-the-art detection algorithm. The dataset contains 1,717 identities in total.

8 papers1 benchmarksImages

HumAID (Human-Annotated Disaster Incidents Data)

Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models.

8 papers0 benchmarksTexts

RIMES (Reconnaissance & Indexation de données Manuscrites et de fac similÉS / Recognition & Indexing of handwritten documents & faxes)

The RIMES database (Reconnaissance et Indexation de données Manuscrites et de fac similÉS / Recognition and Indexing of handwritten documents and faxes) was created to evaluate automatic systems of recognition and indexing of handwritten letters. Of particular interest are cases such as those sent by postal mail or fax by individuals to companies or administrations.

8 papers0 benchmarksImages, Texts

ManyTypes4Py

ManyTypes4Py is a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files.

8 papers0 benchmarksTexts

PreviousPage 171 of 1000Next