19,997 machine learning datasets
19,997 dataset results
Is a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. A dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark.
A benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset.
A dataset (in English; and also extended to Hindi) with human-written navigation and assembling instructions, and the corresponding ground truth trajectories.
A crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations.
Atlas is a dataset for e-commerce clothing product categorization. The Atlas dataset consists of a high-quality product taxonomy dataset focusing on clothing products which contain 186,150 images under clothing category with 3 levels and 52 leaf nodes in the taxonomy.
This dataset contains Bangla handwritten numerals, basic characters and compound characters. This dataset was collected from multiple geographical location within Bangladesh and includes sample collected from a variety of aged groups. This dataset can also be used for other classification problems i.e: gender, age, district.
The Caltech Pedestrian Dataset consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 frames (in 137 approximately minute long segments) with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated. The annotation includes temporal correspondence between bounding boxes and detailed occlusion labels.
Approximately 240,000 documents were collected and aligned using the Hunalign tool.
CC-19 is a small new dataset related to the latest family of coronavirus i.e. COVID-19. The proposed dataset “CC-19” contains 34,006 CT scan slices (images) belonging to 98 subjects out of which 28,395 CT scan slices belong to positive COVID patients.
The Composable activities dataset consists of 693 videos that contain activities in 16 classes performed by 14 actors. Each activity is composed of 3 to 11 atomic actions. RGB-D data for each sequence is captured using a Microsoft Kinect sensor and estimate position of relevant body joints.
A novel dataset that represents complex conversational interactions between two individuals via 3D pose. 8 pairwise interactions describing 7 separate conversation based scenarios were collected using two Kinect depth sensors.
A benchmark dataset with 960 pairs of Chinese wOrd Similarity, where all the words have two morphemes in three Part of Speech (POS) tags with their human annotated similarity rather than relatedness.
The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 600 source-derived document pairs collected from the field of journalism. It can be used to evaluate mono-lingual text reuse detection systems in general and specifically for Urdu language.
COVID-CQ is a stance data set of user-generated content on Twitter in the context of COVID-19.
A benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.
This dataset is constructed and based on the online free-access fictions that are tagged with sci-fi, urban novel, love story, youth, etc. It is used for Writing Polishment with Smile (WPS) a task that aims to polish plain text with similes. All similes are extracted by rich regular expression, and the extraction precision is estimated as 92% by labelling 500 random extracted samples. It contains 5M samples for training and 2.5k for validation and test respectively.
A dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.
Based on simulated challenging conditions that correspond to adversaries that can occur in real-world environments and systems.
The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.
DECADE is a large-scale dataset of ego-centric videos from a dog's perspective as well as her corresponding movements.