19,997 machine learning datasets
19,997 dataset results
Consists of eye movements and verbal descriptions recorded synchronously over images.
Consists of 10K sentence pairs which are human-annotated for semantic relatedness and entailment. The dataset may be used for the evaluation of compositional distributional semantics models of Polish.
The CECW dataset is a color-extended version of the Cleanup World (CW) borrowed from the mobile-manipulation robot domain. CW refers to a world equipped with a movable object as well as four rooms in four colors, including "blue," "green," "red," and "yellow," which is designed as a simulation environment where the agent can act based on the instructions received. CW obeys a particular Geometric Linear Temporal Logic (GLTL) to parse commands by grammatical syntax, resulting in a total of 3,382 commands reflecting 39 GLTL expressions.
The dataset has 93 image stacks and their corresponding Extended Depth of Field (EDF) image acquired from cases with grades Nagative, LSIL or HSIL (The Bethesda System): - Negative: 16 - LSIL: 46 - HSIL: 31 The ground truth includes the grade labels for each frame and manually marked points inside cervical cells in each frame. There are in total 2705 manually marked points inside all frames: - Negative: 238 - LSIL: 1536 - HSIL: 931
Large-scale Chinese legal dataset for judgment prediction. \dataset contains more than 2.6 million criminal cases published by the Supreme People's Court of China, which are several times larger than other datasets in existing works on judgment prediction.
Chinese Literature NER RE is a Discourse-Level Named Entity Recognition and Relation Extraction Dataset for Chinese Literature Text. It is constructed from hundreds of Chinese literature articles.
The Chinese Traditional Painting dataset for style transfer contains 1000 content images and 100 style images. The content images are mostly the photorealistic scenes of mountain, lake, river, bridge, and buildings in regions south of the Yangtze River. It includes not only the scenes of China, but also beautiful pictures of Rhine, Alps, Yellow Stone, Grand Canyon, etc. The content images include diverse types of Chinese traditional paintings.
CITR Dataset consists of experimentally designed fundamental VCI scenarios (front, back, and lateral VCIs) and provides unique ID for each pedestrian, which is suitable for exploring a specific aspect of VCI. DUT dataset gives two ordinary and natural VCI scenarios in crowded university campus, which can be used for more general purpose VCI exploration.
CLAD (Compled and Long Activities Dataset) is an activity dataset which exhibits real-life and diverse scenarios of complex, temporally-extended human activities and actions. The dataset consists of a set of videos of actors performing everyday activities in a natural and unscripted manner. The dataset was recorded using a static Kinect 2 sensor which is commonly used on many robotic platforms. The dataset comprises of RGB-D images, point cloud data, automatically generated skeleton tracks in addition to crowdsourced annotations.
A satellite-based dataset called "CloudCast". It consists of 70080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928 × 1530 pixels (3 × 3 km per pixel) with 15-min intervals between frames for the period January 1, 2017, to December 31, 2018. All frames are centered and projected over Europe.
The Coherent Multiple Choice Narrative Cloze (CMCNC) dataset is an evaluation dataset for the multi-choice narrative cloze task, where the goal is to distinguish which event has been held out from a document from a small set of randomly drawn events.
A diverse dataset of written code-switched productions, curated from topical threads of multiple bilingual communities on the Reddit discussion platform, and explore questions that were mainly addressed in the context of spoken language thus far.
The Composed Quora dataset consists of questions extracted from Quora that are grouped together if they are asking the same thing. The dataset contains 60,400 groups of questions, each group with at least 3 questions that are asking the same.
Corpus of controversial news articles extracted from Twitter. Contains news from three different topics: Beef Ban – controversy over the slaughter and sale of beef on religious grounds (1543 articles) is localised to a particular region, mainly Indian subcontinent, while Gun Control – restrictions on carrying, using, or purchasing firearms (6494 articles) and Capital Punishment – use of the death penalty (7905 articles) are topical in various regions around the world.
Contains more than 5,000 images of 10,000 liquid containers in context labelled with volume, amount of content, bounding box annotation, and corresponding similar 3D CAD models.
COSTRA 1.0 is a dataset of complex sentence transformations. The dataset is intended for the study of sentence-level embeddings beyond simple word alternations or standard paraphrasing. The first version of the dataset is limited to sentences in Czech but the construction method is universal and the authors plan to use it also for other languages. The dataset consist of 4,262 unique sentences with average length of 10 words, illustrating 15 types of modifications such as simplification, generalization, or formal and informal language variation.
The Covid19-CountryImage dataset is a Twitter dataset which contains COVID-19-related tweets.
The COVID-19 Posteroanterior Chest X-Ray fused (CPCXR) dataset is generated by the fusion of three publicly available datasets: COVID-19 cxr image, Radiological Society of North America (RSNA), and U.S. national library of medicine (USNLM) collected Montgomery country - NLM(MC). The dataset consists of samples of diseases labeled as COVID-19, Tuberculosis, Other pneumonia (SARS, MERS, etc.), and Normal. The dataset can be utilized to train an evaulate deep learning and machine learning models as binary and multi-class classification problem.
A large-scale database including substantial CU partition data for HEVC intra- and inter-modes. This enables deep learning on the CU partition.
Provides two large-scale multi-step benchmarks for biometric identification, where the visual appearance of different classes are highly relevant.