Datasets

19,997 machine learning datasets

19,997 dataset results

ToT (Test of Time)

ToT is a benchmark for evaluating LLMs on temporal reasoning.

InfiniBench (InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding)

We introduce InfiniBench a comprehensive benchmark for very long video understanding, which presents 1) The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score.

5 papers0 benchmarks

HARPER (Exploring 3D Human Pose Estimation and Forecasting from the Robot’s Perspective: The HARPER Dataset)

We introduce HARPER, a novel dataset for 3D body pose estimation and forecast in dyadic interactions between users and \spot, the quadruped robot manufactured by Boston Dynamics. The key-novelty is the focus on the robot's perspective, i.e., on the data captured by the robot's sensors. These make 3D body pose analysis challenging because being close to the ground captures humans only partially. The scenario underlying HARPER includes 15 actions, of which 10 involve physical contact between the robot and users. The Corpus contains not only the recordings of the built-in stereo cameras of Spot, but also those of a 6-camera OptiTrack system (all recordings are synchronized). This leads to ground-truth skeletal representations with a precision lower than a millimeter. In addition, the Corpus includes reproducible benchmarks on 3D Human Pose Estimation, Human Pose Forecasting, and Collision Prediction, all based on publicly available baseline approaches. This enables future HARPER users to

5 papers18 benchmarks3D, Images, RGB-D, Videos

Gen-Video

The first AI-generated video detection datasets.

5 papers0 benchmarksVideos

S.MID (SeMantic InDustry)

SeMantic InDustry (S.MID) is a dataset designed to advance the field of LiDAR semantic segmentation, specifically for robotic applications and large-scale industrial scene. The dataset is based on a hybrid-solid LiDAR (Livox Mid-360). To create S.MID, researchers used an industrial robot to collect a total of 38,904 frames of LiDAR data at a rate of 10 Hz across various substations. The LiDAR point clouds are annotated into 25 categories under professional guidance (14 categories for single frame segmentation task) .

5 papers1 benchmarksLiDAR

OccScanNet

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

5 papers0 benchmarks

M3GIA

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

5 papers0 benchmarksImages, Texts

OAG-L1-Field

A popular dataset for node classification on heterogeneous graphs.

5 papers2 benchmarks

MolOpt (Molecular Optimization)

Open-source benchmark for Practical Molecular Optimization (PMO), to facilitate the transparent and reproducible evaluation of algorithmic advances in molecular optimization. This repository supports 25 molecular design algorithms on 23 tasks with a particular focus on sample efficiency (oracle calls).

5 papers0 benchmarks

Guacamol

Benchmark for de novo molecular design

5 papers0 benchmarksBiomedical

RadGenome-ChestCT

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

5 papers0 benchmarks

Blizzard Challenge 2013 (Blizzard Challenge 2013 - English language tasks)

The English data for voice building was obtained, prepared and provided the the challenge by Lessac Technologies Inc., having originally came from the publishers Voice Factory International Inc. It comprises speech from one female professional narrator & actress, Catherine ‘Bobbie’ Byers, reading the text of a collection of classic novels. These had been divided by the publishers of the original audiobooks into a number of genres, such as “Classic Novels”, “Women’s Classics”, “Young Readers” and so on.

5 papers3 benchmarksAudio

MSD (Mirror Segmentation Dataset)

We construct the first large-scale mirror dataset, named MSD. It includes 4, 018 pairs of images containing mirrors and their corresponding manually annotated masks.

5 papers6 benchmarks

KITTI360pose

The KITTI360Pose dataset encompasses a total area of 15.51 square kilometers across nine urban regions, consisting of 43,381 point cloud- text pairs.

5 papers1 benchmarks

IAM(line-level) (Line-level Handwritten Text Recognition on IAM)

The IAM database contains 13,353 images of handwritten lines of text created by 657 writers. The texts those writers transcribed are from the Lancaster-Oslo/Bergen Corpus of British English. It includes contributions from 657 writers making a total of 1,539 handwritten pages comprising of 115,320 words and is categorized as part of modern collection. The database is labeled at the sentence, line, and word levels.

5 papers4 benchmarksImages, Texts

Organizational Graph

A small RDF Knowledge Graph using FOAF and VCard.

5 papers0 benchmarks

Loong

We propose a novel long-context benchmark, 🐉 Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Loong typically consists of 11 documents per test instance on average, spanning three real-world scenarios in English and Chinese: (1) Financial Reports, (2) Legal Cases, and (3) Academic Papers. Meanwhile, Loong introduces new evaluation tasks from the perspectives of Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Furthermore, Loong features inputs of varying lengths (e.g., 10K-50K, 50K-100K, 100K-200K, beyond 200K) and evaluation tasks of diverse difficulty, enabling fine-grained assessment of LLMs across different context lengths and task complexities.

5 papers0 benchmarksTexts

Nightrain

Synthetically Generated Night-time Weather Degraded Database

5 papers1 benchmarksVideos

Beam-Splitter Deblurring (BSD) (3ms-24ms)

Using the proposed beam-splitter acquisition system, we have collected a new real-world video deblurring dataset (BSD).

5 papers4 benchmarks

AutoHallusion

Large vision-language models (LVLMs) are prone to hallucinations, where certain contextual cues in an image can trigger the language module to produce overconfident and incorrect reasoning about abnormal or hypothetical objects. While some benchmarks have been developed to investigate LVLM hallucinations, they often rely on hand-crafted corner cases whose failure patterns may not generalize well. Additionally, fine-tuning on these examples could undermine their validity. To address this, we aim to scale up the number of cases through an automated approach, reducing human bias in crafting such corner cases. This motivates the development of AutoHallusion, the first automated benchmark generation approach that employs several key strategies to create a diverse range of hallucination examples. Our generated visual-question pairs pose significant challenges to LVLMs, requiring them to overcome contextual biases and distractions to arrive at correct answers. AutoHallusion enables us to crea

5 papers1 benchmarks

PreviousPage 228 of 1000Next