19,997 machine learning datasets
19,997 dataset results
SubjQA is a question answering dataset that focuses on subjective (as opposed to factual) questions and answers. The dataset consists of roughly 10,000 questions over reviews from 6 different domains: books, movies, grocery, electronics, TripAdvisor (i.e. hotels), and restaurants. Each question is paired with a review and a span is highlighted as the answer to the question (with some questions having no answer). Moreover, both questions and answer spans are assigned a subjectivity label by annotators. Questions such as "How much does this product weigh?" is a factual question (i.e., low subjectivity), while "Is this easy to use?" is a subjective question (i.e., high subjectivity).
SynthCity is a 367.9M point synthetic full colour Mobile Laser Scanning point cloud. Every point is assigned a label from one of nine categories.
TalkDown is a labelled dataset for condescension detection in context. The dataset is derived from Reddit, a set of online communities that is diverse in content and tone. The dataset is built from COMMENT and REPLY pairs in which the REPLY targets a specific quoted span (QUOTED) in the COMMENT as being condescending. The dataset contains 3,255 positive (condescend) samples and 3,255 negative ones.
A new dataset describing textual stories for events.
The Video-based Multimodal Summarization with Multimodal Output (VMSMO) corpus consists of 184,920 document-summary pairs, with 180,000 training pairs, 2,460 validation and test pairs. The task for this dataset is generating and appropriate textual summary of an article and choosing a proper cover frame from a video accompanying the article.
Winogender Schemas is a novel, Winograd schema-style set of minimal pair sentences that differ only by pronoun gender.
BosphorusSign22k is a benchmark dataset for vision-based user-independent isolated Sign Language Recognition (SLR). The dataset is based on the BosphorusSign (Camgoz et al., 2016c) corpus which was collected with the purpose of helping both linguistic and computer science communities. It contains isolated videos of Turkish Sign Language glosses from three different domains: Health, finance and commonly used everyday signs. Videos in this dataset were performed by six native signers, which makes this dataset valuable for user independent sign language studies.
VRAI is a large-scale vehicle ReID dataset for UAV-based intelligent applications. The dataset consists of 137, 613 images of 13, 022 vehicle instances. The images of each vehicle instance are captured by cameras of two DJI consumer UAVs at different locations, with a variety of view angles and flight-altitudes (15m to 80m).
Thingi10K is a dataset of 3D-Printing Models. Specifically there are 10,000 models from featured “things” on thingiverse.com, suitable for testing 3D printing techniques such as structural analysis , shape optimization, or solid geometry operations.
Histopathological characterization of colorectal polyps allows to tailor patients' management and follow up with the ultimate aim of avoiding or promptly detecting an invasive carcinoma. Colorectal polyps characterization relies on the histological analysis of tissue samples to determine the polyps malignancy and dysplasia grade. Deep neural networks achieve outstanding accuracy in medical patterns recognition, however they require large sets of annotated training images. We introduce UniToPatho, an annotated dataset of 9536 hematoxylin and eosin stained patches extracted from 292 whole-slide images, meant for training deep neural networks for colorectal polyps classification and adenomas grading. The slides are acquired through a Hamamatsu Nanozoomer S210 scanner at 20× magnification (0.4415 μm/px)
The MRDA corpus consists of about 75 hours of speech from 75 naturally-occurring meetings among 53 speakers. The tagset used for labeling is a modified version of the SWBD-DAMSL tagset. It is annotated with three types of information: marking of the dialogue act segment boundaries, marking of the dialogue acts and marking of correspondences between dialogue acts.
Dataset of high-pT jets from simulations of LHC proton-proton collisions
The NuCLS dataset contains over 220,000 labeled nuclei from breast cancer images from TCGA. These nuclei were annotated through the collaborative effort of pathologists, pathology residents, and medical students using the Digital Slide Archive. These data can be used in several ways to develop and validate algorithms for nuclear detection, classification, and segmentation, or as a resource to develop and evaluate methods for interrater analysis.
The MISAW data set is composed of 27 sequences of micro-surgical anastomosis on artificial blood vessels performed by 3 surgeons and 3 engineering students. The dataset contained video, kinematic, and procedural descriptions synchronized at 30Hz. The procedural descriptions contained phases, steps, and activities performed by the participants.
L3CubeMahaSent is a large publicly available Marathi Sentiment Analysis dataset. It consists of marathi tweets which are manually labelled.
LReID is a benchmark for lifelong person reidentification. It has been built using existing datasets, and it consists of two subsets: LReID-Seen and LReID-Unseen.
SenseReID is a person re-identification dataset for evaluating ReID models. It is captured from real surveillance cameras and the person bounding boxes are obtained from state-of-the-art detection algorithm. The dataset contains 1,717 identities in total.
Social networks are widely used for information consumption and dissemination, especially during time-critical events such as natural disasters. Despite its significantly large volume, social media content is often too noisy for direct use in any application. Therefore, it is important to filter, categorize, and concisely summarize the available content to facilitate effective consumption and decision-making. To address such issues automatic classification systems have been developed using supervised modeling approaches, thanks to the earlier efforts on creating labeled datasets. However, existing datasets are limited in different aspects (e.g., size, contains duplicates) and less suitable to support more advanced and data-hungry deep learning models.
The RIMES database (Reconnaissance et Indexation de données Manuscrites et de fac similÉS / Recognition and Indexing of handwritten documents and faxes) was created to evaluate automatic systems of recognition and indexing of handwritten letters. Of particular interest are cases such as those sent by postal mail or fax by individuals to companies or administrations.
ManyTypes4Py is a large Python dataset for machine learning (ML)-based type inference. The dataset contains a total of 5,382 Python projects with more than 869K type annotations. Duplicate source code files were removed to eliminate the negative effect of the duplication bias. To facilitate training and evaluation of ML models, the dataset was split into training, validation and test sets by files. To extract type information from abstract syntax trees (ASTs), a lightweight static analyzer pipeline is developed and accompanied with the dataset. Using this pipeline, the collected Python projects were analyzed and the results of the AST analysis were stored in JSON-formatted files.