Datasets

19,997 machine learning datasets

19,997 dataset results

AmbigQA

Is a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. A dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark.

3 papers0 benchmarksTexts

Arabic Dataset for Commonsense Validation¬†

A benchmark Arabic dataset for commonsense understanding and validation as well as a baseline research and models trained using the same dataset.

3 papers0 benchmarks

ArraMon

A dataset (in English; and also extended to Hindi) with human-written navigation and assembling instructions, and the corresponding ground truth trajectories.

3 papers0 benchmarks

ASSET Corpus

A crowdsourced multi-reference corpus where each simplification was produced by executing several rewriting transformations.

3 papers0 benchmarks

Atlas

Atlas is a dataset for e-commerce clothing product categorization. The Atlas dataset consists of a high-quality product taxonomy dataset focusing on clothing products which contain 186,150 images under clothing category with 3 levels and 52 leaf nodes in the taxonomy.

3 papers0 benchmarksImages

BanglaLekha-Isolated

This dataset contains Bangla handwritten numerals, basic characters and compound characters. This dataset was collected from multiple geographical location within Bangladesh and includes sample collected from a variety of aged groups. This dataset can also be used for other classification problems i.e: gender, age, district.

3 papers0 benchmarksImages

Caltech Pedestrian Dataset

The Caltech Pedestrian Dataset consists of approximately 10 hours of 640x480 30Hz video taken from a vehicle driving through regular traffic in an urban environment. About 250,000 frames (in 137 approximately minute long segments) with a total of 350,000 bounding boxes and 2300 unique pedestrians were annotated. The annotation includes temporal correspondence between bounding boxes and detailed occlusion labels.

3 papers2 benchmarks

capes

Approximately 240,000 documents were collected and aligned using the Hunalign tool.

3 papers0 benchmarks

CC-19

CC-19 is a small new dataset related to the latest family of coronavirus i.e. COVID-19. The proposed dataset “CC-19” contains 34,006 CT scan slices (images) belonging to 98 subjects out of which 28,395 CT scan slices belong to positive COVID patients.

3 papers0 benchmarksImages

Composable activities dataset

The Composable activities dataset consists of 693 videos that contain activities in 16 classes performed by 14 actors. Each activity is composed of 3 to 11 atomic actions. RGB-D data for each sequence is captured using a Microsoft Kinect sensor and estimate position of relevant body joints.

3 papers0 benchmarksRGB-D, Videos

CONVERSE

A novel dataset that represents complex conversational interactions between two individuals via 3D pose. 8 pairwise interactions describing 7 separate conversation based scenarios were collected using two Kinect depth sensors.

3 papers0 benchmarks

COS960

A benchmark dataset with 960 pairs of Chinese wOrd Similarity, where all the words have two morphemes in three Part of Speech (POS) tags with their human annotated similarity rather than relatedness.

3 papers0 benchmarks

COUNTER

The COUNTER (COrpus of Urdu News TExt Reuse) corpus contains 600 source-derived document pairs collected from the field of journalism. It can be used to evaluate mono-lingual text reuse detection systems in general and specifically for Urdu language.

3 papers0 benchmarks

COVID-CQ

COVID-CQ is a stance data set of user-generated content on Twitter in the context of COVID-19.

3 papers0 benchmarksTexts

CPP (Chinese Polyphones with Pinyin)

A benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation.

3 papers1 benchmarks

CS (Chinese Simile)

This dataset is constructed and based on the online free-access fictions that are tagged with sci-fi, urban novel, love story, youth, etc. It is used for Writing Polishment with Smile (WPS) a task that aims to polish plain text with similes. All similes are extracted by rich regular expression, and the extraction precision is estimated as 92% by labelling 500 random extracted samples. It contains 5M samples for training and 2.5k for validation and test respectively.

3 papers0 benchmarksTexts

CTC (COCO-Text Captioned)

A dataset that allows exploration of cross-modal retrieval where images contain scene-text instances.

3 papers0 benchmarksImages

CURE-TSD (CURE Traffic Sign Detection)

Based on simulated challenging conditions that correspond to adversaries that can occur in real-world environments and systems.

3 papers0 benchmarks

DAWT (Densely Annotated Wikipedia Texts)

The DAWT dataset consists of Densely Annotated Wikipedia Texts across multiple languages. The annotations include labeled text mentions mapping to entities (represented by their Freebase machine ids) as well as the type of the entity. The data set contains total of 13.6M articles, 5.0B tokens, 13.8M mention entity co-occurrences. DAWT contains 4.8 times more anchor text to entity links than originally present in the Wikipedia markup. Moreover, it spans several languages including English, Spanish, Italian, German, French and Arabic.

3 papers0 benchmarksTexts

DECADE

DECADE is a large-scale dataset of ego-centric videos from a dog's perspective as well as her corresponding movements.

3 papers0 benchmarks

PreviousPage 260 of 1000Next