Datasets

19,997 machine learning datasets

19,997 dataset results

CFC (Caltech Fish Counting Dataset)

Caltech Fish Counting Dataset (CFC) is a large-scale dataset for detecting, tracking, and counting fish in sonar videos. This dataset contains over 1,500 videos sourced from seven different sonar cameras.

3 papers0 benchmarksVideos

Florence 4D

Florence 4D is a dataset that consists of dynamic sequences of 3D face models, where a combination of synthetic and real identities exhibit an unprecedented variety of 4D facial expressions, with variations that include the classical neutral-apex transition, but generalize to expression-to-expression. It is designed for research in 4D facial analysis, with a particular focus on dynamic expressions.

3 papers0 benchmarks3D

BAFMD (Bias-Aware Face Mask Detection Dataset)

BAFMD contains images posted on Twitter during the pandemic from around the world with more images from underrepresented race and age groups to mitigate the problem for the face mask detection task.

3 papers0 benchmarksImages

Title2Event

Title2Event is a large-scale sentence-level dataset for benchmarking Open Event Extraction without restricting event types. Title2Event contains more than 42,000 news titles in 34 topics collected from Chinese web pages.

3 papers0 benchmarksTexts

COSIAN (a collection of singing voice annotation)

COSIAN is an annotation collection of Japanese popular (J-POP) songs, focusing on singing style and expression of famous solo-singers.

3 papers0 benchmarksAudio

Conversational Stance Detection

Conversational Stance Detection (CSD) is a dataset with annotations of stances and the structures of conversation threads. It consists of 500 conversation threads (including 500 posts and 5376 comments) from six major social media platforms in Hong Kong.

3 papers0 benchmarksTexts

DOORS (Dataset fOr bOuldeRs Segmentation)

DOORS is a dataset designed for boulders recognition, centroid regression, segmentation, and navigation applications. The dataset is divided into two sets:

3 papers0 benchmarks3D, Images

Stanceosaurus

Stanceosaurus is a corpus of 28,033 tweets in English, Hindi, and Arabic annotated with stance towards 251 misinformation claims. The claims in Stanceosaurus originate from 15 fact-checking sources that cover diverse geographical regions and cultures. Unlike existing stance datasets, it introduces a more fine-grained 5-class labeling strategy with additional subcategories to distinguish implicit stance.

3 papers0 benchmarksTexts

Placenta

Placenta is a benchmark dataset for node classification in an underexplored domain: predicting microanatomical tissue structures from cell graphs in placenta histology whole slide images. Cell graphs are large (>1 million nodes per image), node features are varied (64-dimensions of 11 types of cells), class labels are imbalanced (9 classes ranging from 0.21% of the data to 40.0%), and cellular communities cluster into heterogeneously distributed tissues of widely varying sizes (from 11 nodes to 44,671 nodes for a single structure).

3 papers1 benchmarksBiomedical, Graphs, Medical

HUI speech corpus (Hof University iisys speech dataset)

The data set contains several speakers. The 5 largest are listed individually, the rest are summarized as other. All audio files have a sampling rate of 44.1kHz. For each speaker, there is a clean variant in addition to the full data set, where the quality is even higher. Furthermore, there are various statistics. The dataset can also be used for automatic speech recognition (ASR) if audio files are converted to 16 kHz.

3 papers3 benchmarksAudio

Tobacco800

Tobacco800 is a public subset of the complex document image processing (CDIP) test collection constructed by Illinois Institute of Technology, assembled from 42 million pages of documents (in 7 million multi-page TIFF images) released by tobacco companies under the Master Settlement Agreement and originally hosted at UCSF.

3 papers0 benchmarksImages

Criteo Display Advertising Challenge

his dataset contains feature values and click feedback for millions of display ads. Its purpose is to benchmark algorithms for clickthrough rate (CTR) prediction. It has been used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/

3 papers0 benchmarks

UFPR-Periocular

The UFPR-Periocular dataset has 16,830 images of both eyes (33,660 cropped images of each eye) from 1,122 subjects (2,244 classes).

3 papers0 benchmarksImages

IRIS Multiple Instance Learning Dataset

This dataset contains the data for the paper 'Using Multiple Instance Learning for Explainable Solar Flare Prediction'.

3 papers0 benchmarksHyperspectral images, Physics

POPGym (Partially Observable Process Gym)

POPGym is designed to benchmark memory in deep reinforcement learning. It contains a set of environments and a collection of memory model baselines. The environments are all Partially Observable Markov Decision Process (POMDP) environments following the Openai Gym interface. Our environments follow a few basic tenets:

3 papers0 benchmarksEnvironment

SeaTurtleID

SeaTurtleID is a public large-scale, long-span dataset with sea turtle photographs captured in the wild. The dataset is suitable for benchmarking re-identification methods and evaluating several other computer vision tasks. It consists of 7774 high-resolution photographs of 400 unique individuals collected within 12 years in 1081 encounters. Each photograph is accompanied by rich metadata, e.g., identity label, head segmentation mask, and encounter timestamp.

3 papers0 benchmarksImages

ImageNet_CN (Chinese ImageNet Classification)

transform the ImageNet-1K classification datatset for Chinese models by translating labels and prompts into Chinese.

3 papers1 benchmarksImages

Beijing PM2.5

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

3 papers0 benchmarks

CORSMAL

CORSMAL is a dataset for estimating the position and orientation in 3D (or 6D pose) of an object from a single view. The dataset consists of 138,240 images of rendered hands and forearms holding 48 synthetic objects, split into 3 grasp categories over 30 real backgrounds.

3 papers0 benchmarksImages

The QUAERO French Medical Corpus

A vast amount of information in the biomedical domain is available as natural language free text. An increasing number of documents in the field are written in languages other than English. Therefore, it is essential to develop resources, methods and tools that address Natural Language Processing in the variety of languages used by the biomedical community. In this paper, we report on the development of an extensive corpus of biomedical documents in French annotated at the entity and concept level. Three text genres are covered, comprising a total of 103,056 words. Ten entity categories corresponding to UMLS Semantic Groups were annotated, using automatic pre-annotations validated by trained human annotators. The pre-annotation method was found helful for entities and achieved above 0.83 precision for all text genres. Overall, a total of 26,409 entity annotations were mapped to 5,797 unique UMLS concepts.

3 papers0 benchmarksMedical, Texts

PreviousPage 280 of 1000Next