19,997 machine learning datasets
19,997 dataset results
DPPIN is a collection of dynamic networks, which consists of twelve generated dynamic protein-protein interaction networks of yeast cells, stored in twelve folders.
This dataset contains vibration data recorded on a rotating drive train. This drive train consists of an electronically commutated DC motor and a shaft driven by it, which passes through a roller bearing. With the help of a 3D-printed holder, unbalances with different weights and different radii were attached to the shaft. Besides the strength of the unbalances, the rotation speed of the motor was also varied. This dataset can be used to develop and test algorithms for the automatic detection of unbalances on drive trains. Datasets for 4 differently sized unbalances and for the unbalance-free case were recorded. The vibration data was recorded at a sampling rate of 4096 values per second. Datasets for development (ID "D[0-4]") as well as for evaluation (ID "E[0-4]") are available for each unbalance strength. The rotation speed was varied between approx. 630 and 2330 RPM in the development datasets and between approx. 1060 and 1900 RPM in the evaluation datasets. For each measurement of
The Common Crawl corpus contains petabytes of data collected over 12 years of web crawling. The corpus contains raw web page data, metadata extracts and text extracts. Common Crawl data is stored on Amazon Web Services’ Public Data Sets and on multiple academic cloud platforms across the world.
CADSketchNet is an annotated collection of sketches of 3D CAD models.
AI Playground (AIP) is an open-source, Unreal Engine-based tool for generating and labeling virtual image data. With AIP, it is trivial to capture the same image under different conditions (e.g., fidelity, lighting, etc.) and with different ground truths (e.g., depth or surface normal values). AIP is easily extendable and can be used with or without code.
Dataset of 676 security vulnerabilities patches. In 2017, we mined the commits messages of 238 projects using regular expressions for each vulnerability (cf. Patterns). In 2020, we classified vulnerabilities using the CWE taxonomy. Some vulnerabilities contain the score and severity information (CVEs).
This resource is designed to allow for research into Natural Language Generation. In particular, with neural data-to-text approaches although it is not limited to these.
Cylinder in Crossflow is a synthetic dataset that involves unsteady laminar flow past a cylinder that generates vortex shedding pattern known as a von Kármán vortex street. The governing equations for this system are the incompressible Navier-Stokes equations. The cylinder has a diameter of 1 and the free stream velocity is 1. The kinematic viscosity $\nu$ is varied such that the Reynolds number is between 100 and 400. Symmetry boundary conditions are applied at the top and bottom edges of the domain and an open pressure boundary condition is applied at the outlet. Solutions are generated on the unstructured mesh of 6384 quad elements.
Whereas the action recognition community has focused mostly on detecting simple actions like clapping, walking or jogging, the detection of fights or in general aggressive behaviors has been comparatively less studied. Such capability may be extremely useful in some video surveillance scenarios like in prisons, psychiatric or elderly centers or even in camera phones. After an analysis of previous approaches we test the well-known Bag-of-Words framework used for action recognition in the specific problem of fight detection, along with two of the best action descriptors currently available: STIP and MoSIFT. For the purpose of evaluation and to foster research on violence detection in video we introduce a new video database containing 1000 sequences divided in two groups: fights and non-fights. Experiments on this database and another one with fights from action movies show that fights can be detected with near 90% accuracy.
1000 songs has been selected from Free Music Archive (FMA). The excerpts which were annotated are available in the same package song ids 1 to 1000. Some redundancies were identified, which reduced the dataset down to 744 songs. The dataset is split between the development set (619 songs) and the evaluation set (125 songs). The extracted 45 seconds excerpts are all re-encoded to have the same sampling frequency, i.e, 44100Hz.
This dataset is the Hindi version of standard English MSR-VTT dataset.
Wikidata-14M is a recommender system dataset for recommending items to Wikidata editors. It consists of 220,000 editors responsible for 14 million interactions with 4 million items.
Global WHEAT Dataset 2021 is the extentions of the Global Wheat Dataset 2020. It is the first large-scale dataset for wheat head detection from field optical images. It included a very large range of cultivars from differents continents. Wheat is a staple crop grown all over the world and consequently interest in wheat phenotyping spans the globe. Therefore, it is important that models developed for wheat phenotyping, such as wheat head detection networks, generalize between different growing environments around the world.
The Multinational Structured Address Dataset is a collection of addresses of 61 different countries. The addresses can either be "complete" (all the usual address components) or "incomplete" (missing some usual address components).
MyFood Dataset is an image database for segmenting images of Brazilian foods. Composed of 9 classes: rice, beans, boiled egg, fried egg, pasta, salad, roasted meat, apple and chicken breast. With an average of 125 images per class and a total of 1250 images, with a ratio of 60-20-20 for the training, validation and testing sets, respectively.
RaidaR is a rich annotated image dataset of rainy street scenes. RaidaR consists of 58,542 real rainy images containing several rain-induced artifacts: fog, droplets, road reflections, etc. 5,000/3,658 images were carefully semantic/instance segmentated, respectively.
A dataset of tweets that reference the COVID-19 pandemic with emotion labels.
CalCROP21 is a georeferenced multi-spectral dataset of satellite Imagery and crop labels. It is a semantic segmentation benchmark dataset, for the diverse crops in the Central Valley region of California at 10m spatial resolution using a Google Earth Engine based robust image processing pipeline.
There is currently much interest and activity aimed at building powerful multi-purpose information systems. The agencies involved include DARPA, ARDA and NIST. Their programmes, for example DARPA's TIDES (Translingual Information Detection Extraction and Summarization) programme, ARDA's Advanced Question & Answering Program and NIST's TREC (Text Retrieval Conferences) programme cover a range of subprogrammes. These focus on different tasks requiring their own evaluation designs.
This is the dataset for the CGF 2021 paper "DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks".