Datasets

19,997 machine learning datasets

19,997 dataset results

XorQA-IN:

Given a question in an Indic language and a passage in English, generate a short answer span. We provide both an English and target language answer span in the annotations.

2 papers0 benchmarksTexts

BlendMimic3D (A Synthetic Dataset for Human Pose Estimation)

BlendMimic3D is a pioneering synthetic dataset developed using Blender, designed to enhance Human Pose Estimation (HPE) research. This dataset features diverse scenarios including self-occlusions, object-based occlusions, and out-of-frame occlusions, tailored for the development and testing of advanced HPE models.

2 papers0 benchmarksImages

destruction (desctruction detection dataset)

This dataset contains pre and post destruction images and also segmentation labels for test images.

2 papers0 benchmarksImages

bcTCGA (The Cancer Genome Atlas Program)

This data set comes from breast cancer tissue samples deposited to The Cancer Genome Atlas (TCGA) project. TCGA contains data on tumour samples were assayed on several platforms; this data set compiles results obtained using Agilent mRNA expression microarrays.

2 papers0 benchmarksTabular

news20 (NewsWeeder: learning to filter netnews)

Two datasets featuring binary and multi-class classification. The datasets were first introduced by K. Lang [1]. They can, for instance, be accessed at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

2 papers0 benchmarksTabular

e2006 (10-K Corpus)

From the official description:

2 papers0 benchmarksTabular

Sakuga-42M

Sakuga-42M is a large-scale hand-drawn cartoon video dataset for academic research purposes, it comprises 42 million cartoon keyframes covering various artistic styles, regions, and years, with comprehensive semantic annotations including video-text description pairs, anime tags, content taxonomies, etc. The dataset is intended to support researchers in their exploration of more effective and practical solutions for creating cartoons.

2 papers0 benchmarksVideos

Leipzig Corpora

The Leipzig Corpora Collection presents corpora in different languages using the same format and comparable sources. All data are available as plain text files and can be imported into a MySQL database by using the provided import script. They are intended both for scientific use by corpus linguists as well as for applications such as knowledge extraction programs. The corpora are identical in format and similar in size and content. They contain randomly selected sentences in the language of the corpus and are available in sizes from 10,000 sentences up to 1 million sentences. The sources are either newspaper texts or texts randomly collected from the web. The texts are split into sentences. Non-sentences and foreign language material was removed. Because word co-occurrence information is useful for many applications, these data are precomputed and included as well. For each word, the most significant words appearing as immediate left or right neighbor or appearing anywhere within the

2 papers0 benchmarksTexts

PMOA-CITE

The dataset used in the experiments on the paper "Modeling citation worthiness by using attention‑based bidirectional long short‑term memory networks and interpretable models"

2 papers0 benchmarksTexts

OpenStreetView-5M

OpenStreetView-5M establishes a new open benchmark for geolocation by providing a large, open, and clean dataset. As detailed below, OpenStreetView-5M improves upon several limitations of current geolocation datasets.

2 papers2 benchmarks

NSD (Natural Scenes Dataset)

The Natural Scenes Dataset (NSD) is a large-scale fMRI dataset conducted at ultra-high-field (7T) strength at the Center of Magnetic Resonance Research (CMRR) at the University of Minnesota. The dataset consists of whole-brain, high-resolution (1.8-mm isotropic, 1.6-s sampling rate) fMRI measurements of 8 healthy adult subjects while they viewed thousands of color natural scenes over the course of 30–40 scan sessions. While viewing these images, subjects were engaged in a continuous recognition task in which they reported whether they had seen each given image at any point in the experiment. These data constitute a massive benchmark dataset for computational models of visual representation and cognition, and can support a wide range of scientific inquiry.

2 papers0 benchmarks

BVI-LOWLIGHT (BVI-LOWLIGHT: FULLY REGISTERED DATASETS FOR LOW-LIGHT IMAGE AND VIDEO ENHANCEMENT)

Low-light images and video footage often exhibit issues due to the interplay of various parameters such as aperture, shutter speed, and ISO settings. These interactions can lead to distortions, especially in extreme lighting conditions. This distortion is primarily caused by the inverse relationship between decreasing light intensity and increasing photon noise, which gets amplified with higher sensor gain. Additionally, secondary characteristics like white balance and color effects can also be adversely affected and may require post-processing correction. These distortions not only impact the perceived quality of the images but also pose significant challenges for machine learning tasks, including classification and object detection. This is particularly evident when considering the susceptibility of deep learning networks to adversarial examples.

2 papers0 benchmarks

arXivCS

Source

2 papers0 benchmarksGraphs

PARKS-Gaze

Appearance-based gaze estimation systems have shown great progress recently, yet the performance of these techniques depend on the datasets used for training. Most of the existing gaze estimation datasets setup in interactive settings were recorded in laboratory conditions and those recorded in the wild conditions display limited head pose and illumination variations. Further, we observed little attention so far towards precision evaluations of existing gaze estimation approaches. In this work, we present a large gaze estimation dataset, PARKS-Gaze, with wider head pose and illumination variation and with multiple samples for a single Point of Gaze (PoG). The dataset contains 974 minutes of data from 28 participants with a head pose range of ±60◦ in both yaw and pitch directions. Our within-dataset and cross-dataset evaluations and precision evaluations indicate that the proposed dataset is more challenging and enable models to generalize on unseen participants better than the existing

2 papers0 benchmarksImages

PanCancer Multimodal (HoneyBee)

Dataset Card for The Cancer Genome Atlas (TCGA) Multimodal Dataset

2 papers0 benchmarksImages, Medical, Tabular, Texts

DARK FACE (DARK FACE: Face Detection in Low Light Condition)

DARK FACE dataset provides 6,000 real-world low light images captured during the nighttime, at teaching buildings, streets, bridges, overpasses, parks etc., all labeled with bounding boxes for of human face, as the main training and/or validation sets. We also provide 9,000 unlabeled low-light images collected from the same setting. Additionally, we provided a unique set of 789 paired low-light/normal-light images captured in controllable real lighting conditions (but unnecessarily containing faces), which can be used as parts of the training data at the participants' discretization. There will be a hold-out testing set of 4,000 low-light images, with human face bounding boxes annotated.

2 papers0 benchmarksImages

MIMIC-ED-Assist

To support the machine learning (ML) community in developing a time-cost-effective diagnostic assistant, we collaborate with ED clinicians to curate a benchmark, called MIMIC-ED-Assist, that is derived from MIMIC-IV and MIMIC-ED. MIMIC-ED-Assist is designed to test the ability of AI systems to provide both accurate and time-cost saving laboratory recommendations. Our benchmark consists of two prediction targets identified by our clinical collaborators to reflect patient risk: critical outcomes, which include patient death and ICU transfer, and lengthened ED stay, defined as ED LOS exceeding 24 hours. Accurately identifying patients at high risks of these outcomes reduces time-cost by allowing clinicians to perform timely interventions and efficiently allocate resources. MIMIC-ED-Assist mirrors real-world ED practices by grouping individual laboratory tests into commonly performed groups, e.g., complete blood count (CBC). MIMIC-ED-Assist then tests AI systems on their ability to recomme

2 papers0 benchmarks

CPsyCounE

The general multi-turn dialogue evaluation dataset with nine topics. Each topic has five representative cases, resulting in a comprehensive evaluation dataset of 45 cases.

2 papers0 benchmarks

Surgical Scene Graph Generation

The training subset consists of 15 robotic nephrectomy procedures captured on the da Vinci X or Xi system. There are 149 frames per video sequence, and the dimension of each frame is 1280x1024. Segmentation annotations are provided with 10 different classes, including instruments, kidneys, and other objects in the surgical scenario. The main differences with the 2017 instrument segmentation dataset are annotation of kidney parenchyma, surgical objects such as suturing needles, Suturing thread clips, and additional instruments. We annotated the graphical representation of the interaction between the surgical instruments and the defective tissue in the surgical scene with the help of our clinical expertise with the da Vinci Xi robotic system. We also delineate the bounding box to identify all the surgical objects. Kidney and instruments are represented as nodes and active edges annotated as the interaction class in the graph. In total, 12 kinds of interactions were identified to generat

2 papers0 benchmarks

Im4Sketch

Im4Sketch is a large-scale dataset with shape-oriented set of classes for image-to-sketch generalization . It consists of a collection of natural images from 874 categories for training and validation, and sketches from 393 categories (a subset of natural image categories) for testing.

2 papers2 benchmarksImages

PreviousPage 348 of 1000Next