19,997 machine learning datasets
19,997 dataset results
Human keypoint dataset of anime/manga-style character illustrations. Extension of the AnimeDrawingsDataset, with additional features:
Technical Information Dates range from 2017-09-11 to 2018-02-16 and the time interval is 1 minute. This is a MultiIndex CSV file, to load in pandas use:
The Mechanical MNIST Crack Path dataset contains Finite Element simulation results from phase-field models of quasi-static brittle fracture in heterogeneous material domains subjected to prescribed loading and boundary conditions. For all samples, the material domain is a square with a side length of $1$. There is an initial crack of fixed length ($0.25$) on the left edge of each domain. The bottom edge of the domain is fixed in $x$ (horizontal) and $y$ (vertical), the right edge of the domain is fixed in $x$ and free in $y$, and the left edge is free in both $x$ and $y$. The top edge is free in $x$, and in $y$ it is displaced such that, at each step, the displacement increases linearly from zero at the top right corner to the maximum displacement on the top left corner. Maximum displacement starts at $0.0$ and increases to $0.02$ by increments of $0.0001$ ($200$ simulation steps in total). The heterogeneous material distribution is obtained by adding rigid circular inclusions to the d
Each dataset in the Mechanical MNIST collection contains the results of 70,000 (60,000 training examples + 10,000 test examples) finite element simulation of a heterogeneous material subject to large deformation. Mechanical MNIST is generated by first converting the MNIST bitmap images (http://www.pymvpa.org/datadb/mnist.html) to 2D heterogeneous blocks of material. Consistent with the MNIST bitmap ($28 \times 28$ pixels), the material domain is a $28 \times 28$ unit square. All simulations are conducted with the FEniCS computing platform (https://fenicsproject.org). The code to reproduce these simulations is hosted on GitHub (https://github.com/elejeune11/Mechanical-MNIST/tree/master/generate_dataset).
KanHope is a code mixed hope speech dataset for equality, diversity, and inclusion in Kannada, an under-resourced Dravidian language. The dataset consists of 6,176 user-generated comments in code mixed Kannada crawled from YouTube and manually labelled as bearing hope speech or not-hope speech.
The dataset contains traffic traces collected from 3 different VR applications. Researchers can use this dataset to replicate the behavior of real VR traffic directly in their studies, e.g., their simulations. Further information can be found in the repository.
MobIE is a German-language dataset which is human-annotated with 20 coarse- and fine-grained entity types and entity linking information for geographically linkable entities. The dataset consists of 3,232 social media texts and traffic reports with 91K tokens, and contains 20.5K annotated entities, 13.1K of which are linked to a knowledge base. A subset of the dataset is human-annotated with seven mobility-related, n-ary relation types, while the remaining documents are annotated using a weakly-supervised labeling approach implemented with the Snorkel framework.
BugRepo maintains a collection of bug reports that are publicly available for research purposes. Bug reports are a main data source for facilitating NLP-based research in software engineering. We categorize the datasets into the following research directions.
5 domains: synthetic domain, document domain, street view domain, handwritten domain, and car license domain over five million images
The Potsdam Commentary Corpus (PCC) is a corpus of 220 German newspaper commentaries (2.900 sentences, 44.000 tokens) taken from the online issues of the Märkische Allgemeine Zeitung (MAZ subcorpus) and Tagesspiegel (ProCon subcorpus) and is annotated with a range of different types of linguistic information.
To automatically generate Python and assembly programs used for security exploits, we curated a large dataset for feeding NMT techniques. A sample in the dataset consists of a snippet of code from these exploits and their corresponding description in the English language. We collected exploits from publicly available databases (exploitdb, shellstorm), public repositories (e.g., GitHub), and programming guidelines. In particular, we focused on exploits targeting Linux, the most common OS for security-critical network services, running on IA-32 (i.e., the 32-bit version of the x86 Intel Architecture). The dataset is stored in the folder EVIL/datasets and consists of two parts: i) Encoders: a Python dataset, which contains Python code used by exploits to encode the shellcode; ii) Decoders: an assembly dataset, which includes shellcode and decoders to revert the encoding.
Large-scale shadows from buildings in a city play an important role in determining the environmental quality of public spaces. They can be both beneficial, such as for pedestrians during summer, and detrimental, by impacting vegetation and by blocking direct sunlight. Determining the effects of shadows requires the accumulation of shadows over time across different periods in a year. In our paper Shadow Accrual Maps: Efficient Accumulation of City-Scale Shadows over Time, we present a simple yet efficient class of approach that uses the properties of sun movement to track the changing position of shadows within a fixed time interval. This repository presents the computed shadow information for New York City, Chicago, Los Angeles, Boston and Washington DC.
We introduce the SHAD3S dataset, that for a given contour representation of a mesh, under a given illumination condition, provides the illumination masks on the object, a shadow mask on the ground, its diffuse and sketch renders.
This dataset consists of an unpaired and paired set of images captured by two different smartphone cameras: Samsung Galaxy S9 and iPhone X. The unpaired set includes 196 images captured by each smartphone camera (total of 392). The paired set includes 115 pair of images used for testing. In addition to this paired set, we have another small set of 22 anchor paired images
This failure dataset contains information on the events collected in the OpenStack cloud computing platform during three different campaigns of fault-injection experiments performed with three different workloads.
The emerging MPEG point cloud codecs (V-PCC and G-PCC variants) are assessed, and best practices for rate allocation are investigated [1]. For this purpose, three experiments are conducted. In the first experiment, a rigorous evaluation of the codecs is performed, adopting test conditions dictated by experts of the group on a carefully selected set of models, using both subjective and objective quality assessment methodologies. In the other two experiments, different rate allocation schemes for geometry-only and geometry-plus-color encoding are subjectively evaluated, in order to draw conclusions on the best-performing approaches in terms of perceived quality for a given bit rate.
Pano3D is a new benchmark for depth estimation from spherical panoramas. Its goal is to drive progress for this task in a consistent and holistic manner. The Pano3D 360 depth estimation benchmark provides a standard Matterport3D train and test split, as well as a secondary GibsonV2 partioning for testing and training as well. The latter is used for zero-shot cross dataset transfer performance assessment and decomposes it into 3 different splits, each one focusing on a specific generalization axis.
Fashion-MNT is large-scale bilingual product description dataset called Fashion-MMT, which contains over 114k noisy and 40k manually cleaned description translations with multiple product images.
The Graphine dataset contains 2,010,648 terminology definition pairs organized in 227 directed acyclic graphs. Each node in the graph is associated with a terminology and its definition. Terminologies are organized from coarse-grained ones to fine-grained ones in each graph.
MuCo-VQA consist of large-scale (3.7M) multilingual and code-mixed VQA datasets in multiple languages: Hindi (hi), Bengali (bn), Spanish (es), German (de), French (fr) and code-mixed language pairs: en-hi, en-bn, en-fr, en-de and en-es.