Datasets

3,148 machine learning datasets

3,148 dataset results

WMT-SLT

We provide separate training, development and test data. The training data is available right away. The development and test data will be released in several stages, starting with a release of the development sources only.

1 papers0 benchmarksRGB Video, Texts

MTTN

MTTN is a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO, Flickr, etc. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations.

1 papers0 benchmarksTexts

UICaption

UICaption is a dataset of 114k UI images paired with descriptions of their functionality. It is designed for the tasks of UI action entailment, instruction-based UI image retrieval, grounding referring expressions, and UI entity recognition.

1 papers0 benchmarksImages, Texts

REN-20k Dataset

Reader Emotion News 20k Dataset

1 papers0 benchmarksTexts

EHT (The English Headline Treebank)

The English Headline Treebank (EHT) is an English headline treebank of 1,055 manually annotated and adjudicated universal dependency (UD) syntactic dependency trees to encourage research in improving NLP pipelines for English headlines.

1 papers0 benchmarksTexts

DocOIE

We manually annotate 800 sentences from 80 documents in two domains (Healthcare and Transportation) to form a DocOIE dataset for evaluation.

1 papers0 benchmarksTexts

ASR-RAMC-BIGCCSC: A CHINESE CONVERSATIONAL SPEECH CORPUS

A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.

1 papers0 benchmarksAudio, Texts

Performance Improving Code Edits (PIE) (Performance Improving Code Edits)

PIE stands for Performance Improving Code Edits. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program’s performance.

1 papers0 benchmarksTexts

Press Briefing Claim Dataset

Press Briefing Claim Dataset The dataset contains a total of 53 press briefings from a time span of over four years (2017-2021). While, on average, one press briefing per month is held, the distribution is highly skewed towards recent years.

1 papers0 benchmarksTexts

Dubbing Test Set

Dubbing Test Set consists of two subsets extracted from the En→De test set of COVOST-2, a large-scale multilingual speech translation corpus based on Common Voice. Specifically, the first subset is created by randomly sampling 91 sentences (test91), while the second is randomly sampled 101 sentences from the longest 10% of the De part of the test set (test101).

1 papers0 benchmarksTexts, Videos

OpenD5

OpenD5 is a a meta-dataset which aggregates 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health, and uses a set of unified evaluation metrics: validity, relevance, novelty, and significance. It is designed for the new task, D5, that automatically discovers differences between two large corpora in a goal-driven way.

1 papers0 benchmarksTexts

Winograd Automatic (Winograd)

The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.

1 papers1 benchmarksTexts

IoT-23 (IoT-23: A labeled dataset with malicious and benign IoT network traffic)

IoT-23 is a dataset of network traffic from Internet of Things (IoT) devices. It has 20 malware captures executed in IoT devices, and 3 captures for benign IoT devices traffic. It was first published in January 2020, with captures ranging from 2018 to 2019. These IoT network traffic was captured in the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. Its goal is to offer a large dataset of real and labeled IoT malware infections and IoT benign traffic for researchers to develop machine learning algorithms. This dataset and its research was funded by Avast Software. The malware was allow to connect to the Internet.

1 papers0 benchmarksTexts, Tracking

VTQA (Visual Text Question Answering)

VTQA is a dataset containing open-ended questions about image-text pairs. This dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this dataset is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation. VTQA dataset consists of 10,238 image-text pairs and 27,317 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.

1 papers0 benchmarksImages, Texts

PubChem18 (PubChem 2018)

A.2.1 AN OPEN, LARGE-SCALE DATASET FOR ZERO-SHOT DRUG DISCOVERY DERIVED FROM PUBCHEM We constructed a large public dataset extracted from PubChem (Kim et al., 2019; Preuer et al., 2018), an open chemistry database, and the largest collection of readily available chemical data. We take assays ranging from 2004 to 2018-05. It initially comprises 224,290,250 records of molecule-bioassay activity, corresponding to 2,120,854 unique molecules and 21,003 unique bioassays. We find that some molecule-bioassay pairs have multiple activity records, which may not all agree. We reduce every molecule-bioassay pair to exactly one activity measurement by applying majority voting. Molecule-bioassay pairs with ties are discarded. This step yields our final bioactivity dataset, which features 223,219,241 records of molecule-bioassay activity, corresponding to 2,120,811 unique molecules and 21,002 unique bioassays ranging from AID 1 to AID 1259411. Molecules range up to CID 132472079. The dataset has 3 di

1 papers0 benchmarksBiology, Texts

LPSC (Planetary Science Data Set)

This data set contains annotated text versions of 1635 two-page abstracts published at the Lunar and Planetary Science Conference from 1998 to 2020 of relevance to four Mars missions. The annotations were generated using named entity recognition and relation extraction provided by the MTE processing pipeline (available at https://github.com/wkiri/MTE), followed by manual review. Annotated entities include Element, Mineral, Property, and Target. Annotated relations include Contains(Target, Element | Mineral) and HasProperty(Target, Property). The extracted information (without full texts) is also available as a database (stored in .csv files) at https://pds-geosciences.wustl.edu/missions/mte/mte.htm .

1 papers0 benchmarksTexts

AISIA-VN-Review-S (AISIA-VN-Review-F)

In AISIA-VN-Review-S and AISIA-VN-Review-F datasets, we first collect 450K customer reviewing comments from various e–commerce websites. Then, we manually label each review to be either positive or negative, resulting in 358,743 positive reviews and 100,699 negative reviews. We named this dataset the sentiment classification from reviews collected by AISIA, the full version (AISIA-VN-Review-F). However, in this work, we are interested in improving the model’s performance when the training data are limited; thus, we only consider a subset of up to 25K training reviews and evaluate the model on another 170K reviews. We refer to this subset from the full dataset as AISIA-VN-Review-S. It is important to emphasize that our team spends a lot of time and effort to manually classify each review into positive or negative sentiments.

1 papers0 benchmarksTexts

PreviousPage 125 of 158Next

Datasets

WMT-SLT

MTTN

UICaption

REN-20k Dataset

EHT (The English Headline Treebank)

DocOIE

ASR-RAMC-BIGCCSC: A CHINESE CONVERSATIONAL SPEECH CORPUS

Performance Improving Code Edits (PIE) (Performance Improving Code Edits)

Press Briefing Claim Dataset

Dubbing Test Set

OpenD5

Winograd Automatic (Winograd)

IoT-23 (IoT-23: A labeled dataset with malicious and benign IoT network traffic)

VTQA (Visual Text Question Answering)

PubChem18 (PubChem 2018)

LPSC (Planetary Science Data Set)

AISIA-VN-Review-S (AISIA-VN-Review-F)

LEA-GCN-dataset

Caselaw4

ShapeIt

Datasets

WMT-SLT

MTTN

UICaption

REN-20k Dataset

EHT (The English Headline Treebank)

DocOIE

ASR-RAMC-BIGCCSC: A CHINESE CONVERSATIONAL SPEECH CORPUS

Performance Improving Code Edits (PIE) (Performance Improving Code Edits)

Press Briefing Claim Dataset

Dubbing Test Set

OpenD5

Winograd Automatic (Winograd)

IoT-23 (IoT-23: A labeled dataset with malicious and benign IoT network traffic)

VTQA (Visual Text Question Answering)

PubChem18 (PubChem 2018)

LPSC (Planetary Science Data Set)

AISIA-VN-Review-S (AISIA-VN-Review-F)

LEA-GCN-dataset

Caselaw4

ShapeIt