3,148 machine learning datasets
3,148 dataset results
Fact-based Text Editing dataset based on RotoWire dataset
Pan+ChiPhoto dataset is a Chinese character dataset. It is built by the combination of two datasets: ChiPhoto and Pan_Chinese_Character dataset. The images in this dataset are mainly captured at outdoors in Beijing and Shanghai, China, which involve various scenes like signs, boards, advertisements, banners, objects with texts printed on their surfaces.
The IT Translation Task is a shared task introduced in the First Conference on Machine Translation. Compared to WMT 2016 News, this task brought several novelties to WMT:
The Biomedical Translation Shared Task was first introduced at the First Conference of Machine Translation. The task aims to evaluate systems for the translation of biomedical titles and abstracts from scientific publications. The data includes three language pairs (English ↔ Portuguese, English ↔ Spanish, English ↔ French) and two sub-domains of biological sciences and health sciences.
The Medical Translation Task of WMT 2014 addresses the problem of domain-specific and genre-specific machine translation. The task is split into two subtasks: summary translation, focused on translation of sentences from summaries of medical articles, and query translation, focused on translation of queries entered by users into medical information search engines. Both subtasks included translation between English and Czech, German, and French, in both directions.
News translation is a recurring WMT task. The test set is a collection of parallel corpora consisting of about 1500 English sentences translated into 5 languages (Czech, German, Finnish, French, Russian) and additional 1500 sentences from each of the 5 languages translated to English. The sentences are taken from newspaper articles for each language pair, except for French, where the test set was drawn from user-generated comments on the news articles (from Guardian and Le Monde). The translation was done by professional translators.
Refer360° is a novel large-scale referring expression recognition dataset consisting of 17,137 instruction sequences and ground-truth actions for completing these instructions in 360° scenes.
Dataset OQRanD and OQGenD for paper "Asking the crowd: Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums" by Zi Chai, Xinyu Xing, Xiaojun Wan and Bo Huang. This paper is accepted by ACL'19.
ScienceExamCER is a collection of resources for studying explanation-centered inference, including explanation graphs for 1,680 questions, with 4,950 tablestore rows, and other analyses of the knowledge required to answer elementary and middle-school science questions.
The TupleInf Open IE dataset contains Open IE tuples extracted from 263K sentences that were used by the solver in “Answering Complex Questions Using Open Information Extraction” (referred as Tuple KB, T). These sentences were collected from a large Web corpus using training questions from 4th and 8th grade as queries. This dataset contains 156K sentences collected for 4th grade questions and 107K sentences for 8th grade questions. Each sentence is followed by the Open IE v4 tuples using their simple format.
This dataset consists of virtual scenes rendered in MuJoCo with multiple views each presented in multiple modalities: image, and synthetic or natural language descriptions. Each scene consists of two or three objects placed on a square walled room, and for each of the 10 camera viewpoint the authors rendered a 3D view of the scene as seen from that viewpoint as well as a synthetically generated description of the scene.
Collects all the courses from XuetangX5, one of the largest MOOCs in China, and this results in 1951 courses. The collected courses involve seven areas: computer science, economics, engineering, foreign language, math, physics, and social science. Each course contains 131 words in its descriptions on average. Contains 706 job postings from the recruiting website operated by JD.com (JD) and 2,456 job postings from the website owned by Tencent corporation (Tencent). The collected job postings involve six areas: technical post, financial post, product post, design post, market post, supply chain and engineering post.
The relational pattern similarity dataset is a new dataset upon the work of Zeichner et al. (2012), which consists of relational patterns with semantic inference labels annotated. The dataset includes 5,555 pairs extracted by Reverb (Fader et al., 2011), 2,447 pairs with inference relation and 3,108 pairs (the rest) without one.
The satire dataset is a new multi-modal dataset of satirical and regular news articles. The satirical news is collected from four websites that explicitly declare themselves to be satire, and the regular news is collected from six mainstream news websites. Specifically, the satirical news websites the articles were collected from are The Babylon Bee, Clickhole, Waterford Whisper News, and The DailyER. The regular news websites are Reuters, The Hill, Politico, New York Post, Huffington Post, and Vice News. The headlines and the thumbnail images of the latest 1000 articles for each of the publications are collected. The dataset contains a total of 4000 satirical and 6000 regular news articles.
NText is an eight million words dataset extracted and preprocessed from nuclear research papers and thesis.
Includes two datasets for this task, one for English-French (En-Fr) and another for English-German (En-De). For each dataset, the action sequences for full documents are provided, along with an editor identifier. The dataset contains document-level post-editing action sequences, including edit operations from keystrokes, mouse actions, and waiting times.
Food.com Recipes and Interactions consists of 270K recipes and 1.4M user-recipe interactions (reviews) scraped from Food.com, covering a period of 18 years (January 2000 to December 2018).
CSAbstruct is a new dataset of annotated computer science abstracts with sentence labels according to their rhetorical roles. The key difference between this dataset and PUBMED-RCT is that PubMed abstracts are written according to a predefined structure, whereas computer science papers are free-form. Therefore, there is more variety in writing styles in CSABSTRUCT. CSABSTRUCT is collected from the Semantic Scholar corpus (Ammar et al., 2018). Each sentence is annotated by 5 workers on the Figure-eight platform,6 with one of 5 categories {BACKGROUND, OBJECTIVE, METHOD, RESULT, OTHER}.
The SOLO Corpus comprises over 4 million English tweets, each of which contains at least one of the following tokens: solitude, lonely, and loneliness. The corpus has been collected to analyze the language and emotions associated with the state of being alone in English tweets.
The proposed dataset includes 1,309 short text instances from Adobe Spark. The dataset is a collection of publicly available sample texts created by different designers. It covers a variety of topics found in posters, flyers, motivational quotes and advertisements.