19,997 machine learning datasets
19,997 dataset results
https://github.com/YatingMusic/compound-word-transformer
Who's Waldo is a dataset of 270K image–caption pairs, depicting interactions of people, that is automatically mined from Wikimedia Commons. It is a benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image.
NLPContributionGraph was introduced as Task 11 at SemEval 2021 for the first time. The task is defined on a dataset of Natural Language Processing (NLP) scholarly articles with their contributions structured to be integrable within Knowledge Graph infrastructures such as the Open Research Knowledge Graph. The structured contribution annotations are provided as (1) Contribution sentences : a set of sentences about the contribution in the article; (2) Scientific terms and relations: a set of scientific terms and relational cue phrases extracted from the contribution sentences; and (3) Triples: semantic statements that pair scientific terms with a relation, modeled toward subject-predicate-object RDF statements for KG building. The Triples are organized under three (mandatory) or more of twelve total information units (viz., ResearchProblem, Approach, Model, Code, Dataset, ExperimentalSetup, Hyperparameters, Baselines, Results, Tasks, Experiments, and AblationAnalysis).
Collected in the snow belt region of Michigan's Upper Peninsula, WADS is the first multi-modal dataset featuring dense point-wise labeled sequential LiDAR scans collected in severe winter weather.
This dataset of motions is free for all uses.
The DSSE-200 is a complex document layout dataset including various dataset styles. The dataset contains 200 images from pictures, PPT, brochure documents, old newspapers and scanned documents.
OPERAnet is a multimodal activity recognition dataset acquired from radio frequency and vision-based sensors. Approximately 8 hours of annotated measurements are provided, which are collected across two different rooms from 6 participants performing 6 activities, namely, sitting down on a chair, standing from sit, lying down on the ground, standing from the floor, walking and body rotating. The dataset has been acquired from four synchronized modalities for the purpose of passive Human Activity Recognition (HAR) as well as localization and crowd counting.
SustainBench is a collection of 15 benchmark tasks across 7 sustainable development goals (SDGs), including tasks related to economic development, agriculture, health, education, water and sanitation, climate action, and life on land. The goals for SustainBench are to:
Over 1.5K images selected from the public Kaggle DR Detection dataset; Five DR grades (DR0 / DR1 / DR2 / DR3 / DR4), re-labeled by a panel of 45 experienced ophthalmologists; Eight retinal lesion classes, including microaneurysm, intraretinal hemorrhage, hard exudate, cotton-wool spot, vitreous hemorrhage, preretinal hemorrhage, neovascularization and fibrous proliferation; Over 34K expert-labeled pixel-level lesion segments; Multi-task, i.e., lesion segmentation, lesion classification, and DR grading.
WildReceipt is a collection of receipts. It contains, for each photo, of a list of OCRs - with bounding box, text, and class.
Dataset [46 M] and readme: 42,306 movie plot summaries extracted from Wikipedia + aligned metadata extracted from Freebase, including: Movie box office revenue, genre, release date, runtime, and language Character names and aligned information about the actors who portray them, including gender and estimated age at the time of the movie's release Supplement: Stanford CoreNLP-processed summaries [628 M]. All of the plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref).
| | Train | Validation | Test | Ranking Test | | --------- | ----- | ---------- | ------- | ------------ | | size | 0.4M | 50K | 5K | 800 | | pos:neg | 1:1 | 1:9 | 1.2:8.8 | - | | avg turns | 5.0 | 5.0 | 5.0 | 5.0 |
OpenFWI is a collection of large-scale open-source benchmark datasets for seismic full waveform inversion (FWI). OpenFWI is catered for the geoscience and machine learning community to facilitate diversified, rigorous and reproducible research on machine learning-based FWI.
Many e-shops have started to mark-up product data within their HTML pages using the schema.org vocabulary. The Web Data Commons project regularly extracts such data from the Common Crawl, a large public web crawl. The Web Data Commons Training and Test Sets for Large-Scale Product Matching contain product offers from different e-shops in the form of binary product pairs (with corresponding label "match" or "no match") for four product categories, computers, cameras, watches and shoes.
This is a new dataset of news headlines and their frames related to the issue of gun violence in the United States. This Gun Violence Frame Corpus (GVFC) was curated and annotated by journalism and communication experts. The articles in this dataset are drawn from a sample of news articles from a list of 30 top U.S. news websites defined in terms of traffic to the websites; and collected from four time periods over the course of 2018 in order to capture a diversity of articles.
Amazon-Fraud is a multi-relational graph dataset built upon the Amazon review dataset, which can be used in evaluating graph-based node classification, fraud detection, and anomaly detection models.
The NLC2CMD Competition hosted at NeurIPS 2020 aimed to bring the power of natural language processing to the command line. Participants were tasked with building models that can transform descriptions of command line tasks in English to their Bash syntax.
Abstract Objective This article summarizes the preparation, organization, evaluation, and results of Track 2 of the 2018 National NLP Clinical Challenges shared task. Track 2 focused on extraction of adverse drug events (ADEs) from clinical records and evaluated 3 tasks: concept extraction, relation classification, and end-to-end systems. We perform an analysis of the results to identify the state of the art in these tasks, learn from it, and build on it.
This dataset, called RodoSol-ALPR dataset, contains 20,000 images captured by static cameras located at pay tolls owned by the Rodovia do Sol (RodoSol) concessionaire, which operates 67.5 kilometers of a highway (ES-060) in the Brazilian state of Espírito Santo.
The Oulu-NPU face presentation attack detection database consists of 4950 real access and attack videos. These videos were recorded using the front cameras of six mobile devices (Samsung Galaxy S6 edge, HTC Desire EYE, MEIZU X5, ASUS Zenfone Selfie, Sony XPERIA C5 Ultra Dual and OPPO N3) in three sessions with different illumination conditions and background scenes. The presentation attack types considered in the OULU-NPU database are print and video-replay. The 2D face artefacts were created using two printers and two display devices.