19,997 machine learning datasets
19,997 dataset results
Caltech Fish Counting Dataset (CFC) is a large-scale dataset for detecting, tracking, and counting fish in sonar videos. This dataset contains over 1,500 videos sourced from seven different sonar cameras.
Florence 4D is a dataset that consists of dynamic sequences of 3D face models, where a combination of synthetic and real identities exhibit an unprecedented variety of 4D facial expressions, with variations that include the classical neutral-apex transition, but generalize to expression-to-expression. It is designed for research in 4D facial analysis, with a particular focus on dynamic expressions.
BAFMD contains images posted on Twitter during the pandemic from around the world with more images from underrepresented race and age groups to mitigate the problem for the face mask detection task.
Title2Event is a large-scale sentence-level dataset for benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.
COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.
Conversational Stance Detection (CSD) is a dataset with annotations of stances and the structures of conversation threads. It consists of 500 conversation threads (including 500 posts and 5376 comments) from six major social media platforms in Hong Kong.
DOORS is a dataset designed for boulders recognition, centroid regression, segmentation, and navigation applications. The dataset is divided into two sets:
Stanceosaurus is a corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, it introduces a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance.
Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).
The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition to the full data set, where the quality is even higher. Furthermore, there are various statistics. The dataset can also be used for automatic speech recognition (ASR) if audio files are converted to 16 kHz.
Tobacco800 is a public subset of the complex document image processing (CDIP) test collection constructed by Illinois Institute of Technology, assembled from 42 million pages of documents (in 7 million multi-page TIFF images) released by tobacco companies under the Master Settlement Agreement and originally hosted at UCSF.
his dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction. It has been used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/
The UFPR-Periocular dataset has 16,830 images of both eyes (33,660 cropped images of each eye) from 1,122 subjects (2,244 classes).
This dataset contains the data for the paper 'Using Multiple Instance Learning for Explainable Solar Flare Prediction'.
POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines. The environments are all Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface. Our environments follow a few basic tenets:
SeaTurtleID is a public large-scale, long-span dataset with sea turtle photographs captured in the wild. The dataset is suitable for benchmarking re-identification methods and evaluating several other computer vision tasks. It consists of 7774 high-resolution photographs of 400 unique individuals collected within 12 years in 1081 encounters. Each photograph is accompanied by rich metadata, e.g., identity label, head segmentation mask, and encounter timestamp.
transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
CORSMAL is a dataset for estimating the position and orientation in 3D (or 6D pose) of an object from a single view. The dataset consists of 138,240 images of rendered hands and forearms holding 48 synthetic objects, split into 3 grasp categories over 30 real backgrounds.
A vast amount of information in the biomedical domain is available as natural language free text. An increasing number of documents in the field are written in languages other than English. Therefore, it is essential to develop resources, methods and tools that address Natural Language Processing in the variety of languages used by the biomedical community. In this paper, we report on the development of an extensive corpus of biomedical documents in French annotated at the entity and concept level. Three text genres are covered, comprising a total of 103,056 words. Ten entity categories corresponding to UMLS Semantic Groups were annotated, using automatic pre-annotations validated by trained human annotators. The pre-annotation method was found helful for entities and achieved above 0.83 precision for all text genres. Overall, a total of 26,409 entity annotations were mapped to 5,797 unique UMLS concepts.