3,148 machine learning datasets
3,148 dataset results
We provide separate training, development and test data. The training data is available right away. The development and test data will be released in several stages, starting with a release of the development sources only.
MTTN is a large scale derived and synthesized dataset built with on real prompts and indexed with popular image-text datasets like MS-COCO, Flickr, etc. MTTN consists of over 2.4M sentences that are divided over 5 stages creating a combination amounting to over 12M pairs, along with a vocab size of consisting more than 300 thousands unique words that creates an abundance of variations.
UICaption is a dataset of 114k UI images paired with descriptions of their functionality. It is designed for the tasks of UI action entailment, instruction-based UI image retrieval, grounding referring expressions, and UI entity recognition.
Reader Emotion News 20k Dataset
The English Headline Treebank (EHT) is an English headline treebank of 1,055 manually annotated and adjudicated universal dependency (UD) syntactic dependency trees to encourage research in improving NLP pipelines for English headlines.
We manually annotate 800 sentences from 80 documents in two domains (Healthcare and Transportation) to form a DocOIE dataset for evaluation.
A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset, including 180 hours of Mandarin Chinese dialogue, 150, 10 and 20 hours for the training set, development set and test set respectively. It contains 351 multi-turn dialogues, each of which is a coherent and compact conversation centered around one theme.
PIE stands for Performance Improving Code Edits. PIE contains trajectories of programs, where a programmer begins with an initial, slower version and iteratively makes changes to improve the program’s performance.
Press Briefing Claim Dataset The dataset contains a total of 53 press briefings from a time span of over four years (2017-2021). While, on average, one press briefing per month is held, the distribution is highly skewed towards recent years.
Dubbing Test Set consists of two subsets extracted from the En→De test set of COVOST-2, a large-scale multilingual speech translation corpus based on Common Voice. Specifically, the first subset is created by randomly sampling 91 sentences (test91), while the second is randomly sampled 101 sentences from the longest 10% of the De part of the test set (test101).
OpenD5 is a a meta-dataset which aggregates 675 open-ended problems ranging across business, social sciences, humanities, machine learning, and health, and uses a set of unified evaluation metrics: validity, relevance, novelty, and significance. It is designed for the new task, D5, that automatically discovers differences between two large corpora in a goal-driven way.
The Winograd schema challenge composes tasks with syntactic ambiguity, which can be resolved with logic and reasoning.
IoT-23 is a dataset of network traffic from Internet of Things (IoT) devices. It has 20 malware captures executed in IoT devices, and 3 captures for benign IoT devices traffic. It was first published in January 2020, with captures ranging from 2018 to 2019. These IoT network traffic was captured in the Stratosphere Laboratory, AIC group, FEL, CTU University, Czech Republic. Its goal is to offer a large dataset of real and labeled IoT malware infections and IoT benign traffic for researchers to develop machine learning algorithms. This dataset and its research was funded by Avast Software. The malware was allow to connect to the Internet.
VTQA is a dataset containing open-ended questions about image-text pairs. This dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this dataset is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation. VTQA dataset consists of 10,238 image-text pairs and 27,317 questions. The images are real images from MSCOCO dataset, containing a variety of entities. The annotators are required to first annotate relevant text according to the image, and then ask questions based on the image-text pair, and finally answer the question open-ended.
A.2.1 AN OPEN, LARGE-SCALE DATASET FOR ZERO-SHOT DRUG DISCOVERY DERIVED FROM PUBCHEM We constructed a large public dataset extracted from PubChem (Kim et al., 2019; Preuer et al., 2018), an open chemistry database, and the largest collection of readily available chemical data. We take assays ranging from 2004 to 2018-05. It initially comprises 224,290,250 records of molecule-bioassay activity, corresponding to 2,120,854 unique molecules and 21,003 unique bioassays. We find that some molecule-bioassay pairs have multiple activity records, which may not all agree. We reduce every molecule-bioassay pair to exactly one activity measurement by applying majority voting. Molecule-bioassay pairs with ties are discarded. This step yields our final bioactivity dataset, which features 223,219,241 records of molecule-bioassay activity, corresponding to 2,120,811 unique molecules and 21,002 unique bioassays ranging from AID 1 to AID 1259411. Molecules range up to CID 132472079. The dataset has 3 di
This data set contains annotated text versions of 1635 two-page abstracts published at the Lunar and Planetary Science Conference from 1998 to 2020 of relevance to four Mars missions. The annotations were generated using named entity recognition and relation extraction provided by the MTE processing pipeline (available at https://github.com/wkiri/MTE), followed by manual review. Annotated entities include Element, Mineral, Property, and Target. Annotated relations include Contains(Target, Element | Mineral) and HasProperty(Target, Property). The extracted information (without full texts) is also available as a database (stored in .csv files) at https://pds-geosciences.wustl.edu/missions/mte/mte.htm .
In AISIA-VN-Review-S and AISIA-VN-Review-F datasets, we first collect 450K customer reviewing comments from various e–commerce websites. Then, we manually label each review to be either positive or negative, resulting in 358,743 positive reviews and 100,699 negative reviews. We named this dataset the sentiment classification from reviews collected by AISIA, the full version (AISIA-VN-Review-F). However, in this work, we are interested in improving the model’s performance when the training data are limited; thus, we only consider a subset of up to 25K training reviews and evaluate the model on another 170K reviews. We refer to this subset from the full dataset as AISIA-VN-Review-S. It is important to emphasize that our team spends a lot of time and effort to manually classify each review into positive or negative sentiments.
The datasets of "Towards Lightweight Cross-domain Sequential Recommendation via External Attention-enhanced Graph Convolution Network" (DASFAA 2023)
Caselaw4 is a dataset of 350k common law judicial decisions from the U.S. Caselaw Access Project, of which 250k have been automatically annotated with binary outcome labels of AFFIRM and REVERSE.
The ShapeIt dataset introduced by Alper et al. (2023) consists of 109 nouns and noun phrases along with the basic shape normally associated with that item, chosen from the set {circle, rectangle, triangle}.