19,997 machine learning datasets
19,997 dataset results
AwA Pose is a large scale animal keypoint dataset with ground truth annotations for keypoint detection of quadruped animals from images.
Grammatical error correction dataset for text from Wikipedia.
The M5Product dataset is a large-scale multi-modal pre-training dataset with coarse and fine-grained annotations for E-products.
In this paper, we present AnlamVer, which is a semantic model evaluation dataset for Turkish designed to evaluate word similarity and word relatedness tasks while discriminating those two relations from each other. Our dataset consists of 500 word-pairs annotated by 12 human subjects, and each pair has two distinct scores for similarity and relatedness. Word-pairs are selected to enable the evaluation of distributional semantic models by multiple attributes of words and word-pair relations such as frequency, morphology, concreteness and relation types (e.g., synonymy, antonymy). Our aim is to provide insights to semantic model researchers by evaluating models in multiple attributes. We balance dataset word-pairs by their frequencies to evaluate the robustness of semantic models concerning out-of-vocabulary and rare words problems, which are caused by the rich derivational and inflectional morphology of the Turkish language. (from the original abstract of the dataset paper)
Draper VDISC Dataset - Vulnerability Detection in Source Code
Bentham manuscripts refers to a large set of documents that were written by the renowned English philosopher and reformer Jeremy Bentham (1748-1832). Volunteers of the Transcribe Bentham initiative transcribed this collection. Currently, >6 000 documents or > 25 000 pages have been transcribed using this public web platform. For our experiments, we used the BenthamR0 dataset a part of the Bentham manuscripts.
ParaShoot is the first question answering dataset in modern Hebrew. The dataset follows the format and crowdsourcing methodology of SQuAD, and contains approximately 3000 annotated examples, similar to other question-answering datasets in low-resource languages.
XTD10 is a dataset for cross-lingual image retrieval and tagging consisting of the MSCOCO2014 caption test dataset annotated in 7 languages that were collected using a crowdsourcing platform.
iShape is an irregular shape dataset for instance segmentation. iShape contains six sub-datasets with one real and five synthetics, each represents a scene of a typical irregular shape.
The Restaurant-ACOS dataset is constructed based on the SemEval 2016 Restaurant dataset (Pontiki et al., 2016) and its expansion datasets (Fan et al., 2019; Xu et al., 2020). The SemEval 2016 Restaurant dataset (Pontiki et al., 2016) was annotated with explicit and implicit aspects, categories, and sentiment. (Fan et al., 2019; Xu et al., 2020) further added the opinion annotations. We integrate their annotations to construct aspect-category-opinion-sentiment quadruples and further annotate the implicit opinions. The Restaurant-ACOS dataset contains 2286 sentences with 3658 quadruples. It is worth noting that the Restaurant-ACOS is available for all subtasks in ABSA, including aspect-based sentiment classification, aspect-sentiment pair extraction, aspect-opinion pair extraction, aspect-opinion sentiment triple extraction, aspect-category-sentiment triple extraction, etc.
Laptop-ACOS is a brand new Laptop dataset collected from the Amazon platform in the years 2017 and 2018 (covering ten types of laptops under six brands such as ASUS, Acer, Samsung, Lenovo, MBP, MSI, and so on). It contains 4,076 review sentences, much larger than the SemEval Laptop datasets. For Laptop-ACOS, we annotate the four elements and their corresponding quadruples all by ourselves. We employ the aspect categories defined in the SemEval 2016 Laptop dataset. The Laptop-ACOS dataset contains 4076 sentences with 5758 quadruples. As we have mentioned, a large percentage of the quadruples contain implicit aspects or implicit opinions . By comparing two datasets, it can be observed that Laptop-ACOS has a higher percentage of implicit opinions than Restaurant-ACOS . It is worth noting that the Laptop-ACOS is available for all subtasks in ABSA, including aspect-based sentiment classification, aspect-sentiment pair extraction, aspect-opinion pair extraction, aspect-opinion sentiment tri
A dataset of approximately 75,000 phrases and sentences, syntactically analyzed as typelogical derivations (i.e. proofs of modal intuitionistic linear logic, or programs of the corresponding λ calculus). Analyses were obtained by transforming the dependency graphs of the Lassy-Small corpus.
MovingFashion is a dataset for video-to-shop, the task of retrieving clothes which are worn in social media videos. MovingFashion is composed of 14,855 social videos, each one of them associated with e-commerce "shop" images where the corresponding clothing items are clearly portrayed.
The eBDtheque database is a selection of one hundred comic pages from America, Japan (manga) and Europe.
The DCM dataset is composed of 772 annotated images from 27 golden age comic books. We freely collected them from the free public domain collection of digitized comic books Digital Comics Museum. One album per available publisher was selected to get as many different styles as possible. We made ground-truth bounding boxes of all panels, all characters (body + faces), small or big, human-like or animal-like.
The twitter emoji dataset obtained from CodaLab comprises of 50 thousand tweets along with the associated emoji label. Each tweet in the dataset has a corresponding numerical label which maps to a specific emoji. The emojis are of the 20 most frequent emojis and hence the labels range from 0 to 19
Hi-resolution three-dimensional nonrigid shapes in a variety of poses for non-rigid shape similarity and correspondence experiments. The database contains a total of 80 objects, including 11 cats, 9 dogs, 3 wolves, 8 horses, 6 centaurs, 4 gorillas, 12 female figures, and two different male figures, containing 7 and 20 poses. Typical vertex count is about 50,000. Objects within the same class have the same triangulation and an equal number of vertices numbered in a compatible way. This can be used as a per-vertex ground truth correspondence in correspondence experiments. Two representations are available: MATLAB file (.mat) and ASCII text files containing the 1-based list of triangular faces (.tri), and a list of vertex XYZ coordinates (.vert). A .png thumbnail is available for each object.
IndoNLI is the first human-elicited NLI dataset for Indonesian consisting of nearly 18K sentence pairs annotated by crowd workers and experts.
IndoNLG is a benchmark to measure natural language generation (NLG) progress in three low-resource—yet widely spoken—languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks.
To the best of our knowledge, there is no prior dataset specifically constructed for composition assessment. To support the research on this task, we build a dataset upon the existing AADB dataset, from which we collect a total of 9,958 real-world photos. We adopt a composition rating scale from 1 to 5, where a larger score indicates better composition. We make annotation guidelines for composition quality rating and train five individual raters who specialize in fine art. So for each image, we can obtain five composition scores ranging from 1 to 5. Given the subjective nature of human aesthetic activity, we perform sanity check and consistency analysis. We use 240 additional “sanity check” images during annotating to roughly verify the validness of our annotations. We also examine the consistency of composition ratings provided by five individual raters (see Supplementary). We average the composition scores as the ground-truth composition mean score for each image. More details about