Papers With Code 2 | ML Benchmarks, SotA Results & Code

PARIS Dataset (PARIS Two-Part Object Dataset)

From PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects: 5.1. Dataset Synthetic dataset. The synthetic 3D models we use for evaluation are from the PartNet-Mobility dataset [49, 27, 4], a large-scale dataset for articulated objects across 46 categories. We select instances across 10 categories to conduct our experiments. For each articulation state, we randomly sample 64-100 views covering the upper hemisphere of the object to simulate capturing in the real world. Then we render RGB images and acquire camera parameters and object masks using Blender [6] to create our training data. Real-world dataset. The real data we use for experiments is from the MultiScan dataset [25], scanning real-world indoor scenes with articulated objects in multiple states. We use the reconstructed mesh of an object in two states as ground truth for evaluation, and the real RGB frames as training data.

4 papers0 benchmarks3D, Images, RGB-D

USIS10K (Large-scale Underwater Salient Instance Segmentation Dataset)

We construct the first large-scale dataset, USIS10K, for the underwater salient instance segmentation task, which contains 10,632 images and pixel-level annotations of 7 categories. As far as we know, this is the largest salient instance segmentation dataset, and includes Class-Agnostic and Multi-Class labels simultaneously.

4 papers0 benchmarksImages

LingOly

This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems which cover a wide range of languages.

4 papers2 benchmarksTexts

SynthPAI (SynthPAI: A Synthetic Dataset for Personal Attribute Inference)

SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.

4 papers1 benchmarksTexts

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSRVTT-CTN Dataset This dataset contains CTN annotations for the MSRVTT-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

4 papers3 benchmarksTexts, Videos

MSVD-CTN (MSVD Causal-Temporal Narrative)

MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

4 papers3 benchmarksTexts, Videos

VLKEB

A Large Vision-Language Model Knowledge Editing Benchmark

4 papers0 benchmarksImages, Texts

BIOSCAN-5M

As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, we present the BIOSCAN-5M Insect dataset to the machine learning community. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, geographical information, and specimen size.

4 papers0 benchmarksBiology, Images

SpatialQA

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

4 papers0 benchmarks

CAsT-snippets

CAsT-snippets is a high-quality dataset for conversational information seeking containing snippet-level annotations for all queries in the TREC CAsT 2020 and 2022 datasets. It enables the development of answer generation methods that are grounded in relevant snippets in paragraphs as well as allows for the automatic evaluation of the generated answers in terms of completeness; a training/test split is provided for such use.

4 papers0 benchmarksTexts

DSEval-LeetCode

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reve

4 papers3 benchmarks

SecQA

SecQA is a specialized dataset created for the evaluation of Large Language Models (LLMs) in the domain of computer security. It consists of multiple-choice questions, generated using GPT-4 and the Computer Systems Security: Planning for Success textbook, aimed at assessing the understanding and application of LLMs' knowledge in computer security.

4 papers0 benchmarksTexts

LIAR2

The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:

4 papers3 benchmarksTexts

RoadTextVQA

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in

4 papers1 benchmarksTexts, Videos

MUSES: MUlti-SEnsor Semantic perception dataset (The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty)

MUSES offers 2500 multi-modal scenes, evenly distributed across various combinations of weather conditions (clear, fog, rain, and snow) and types of illumination (daytime, nighttime). Each image includes high-quality 2D pixel-level panoptic annotations and class-level and novel instance-level uncertainty annotations. Further, each adverse-condition image has a corresponding image of the same scene taken under clear-weather, daytime conditions. The annotation process for MUSES utilizes all available sensor data, allowing the annotators to also reliably label degraded image regions that are still discernible in other modalities. This results in better pixel coverage in the annotations and creates a more challenging evaluation setup.

4 papers15 benchmarksImages, LiDAR, Point cloud, RGB-D

LayoutBench-COCO - Number

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LayoutBench-COCO - Position

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LayoutBench-COCO - Size

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LayoutBench-COCO - Combination

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

MENSA (Movie Scene Saliency Dataset)

MENSA: Movie Scene Saliency Dataset Dataset Summary The dataset, MENSA (Movie Scene Saliency Dataset) is from the paper "Select and Summarize: Scene Saliency for Movie Script Summarization", and consists of movie scripts and their corresponding summaries. Each scene in the movie script is annotated with scene saliency labels. The training set contains silver labels, which are automatically generated, while the validation and test sets contain human-annotated gold labels.

4 papers4 benchmarks

Datasets

PARIS Dataset (PARIS Two-Part Object Dataset)

USIS10K (Large-scale Underwater Salient Instance Segmentation Dataset)

LingOly

SynthPAI (SynthPAI: A Synthetic Dataset for Personal Attribute Inference)

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSVD-CTN (MSVD Causal-Temporal Narrative)

VLKEB

BIOSCAN-5M

SpatialQA

CAsT-snippets

DSEval-LeetCode

SecQA

LIAR2

RoadTextVQA

MUSES: MUlti-SEnsor Semantic perception dataset (The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty)

LayoutBench-COCO - Number

LayoutBench-COCO - Position

LayoutBench-COCO - Size

LayoutBench-COCO - Combination

MENSA (Movie Scene Saliency Dataset)

Datasets

PARIS Dataset (PARIS Two-Part Object Dataset)

USIS10K (Large-scale Underwater Salient Instance Segmentation Dataset)

LingOly

SynthPAI (SynthPAI: A Synthetic Dataset for Personal Attribute Inference)

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSVD-CTN (MSVD Causal-Temporal Narrative)

VLKEB

BIOSCAN-5M

SpatialQA

CAsT-snippets

DSEval-LeetCode

SecQA

LIAR2

RoadTextVQA

MUSES: MUlti-SEnsor Semantic perception dataset (The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty)

LayoutBench-COCO - Number

LayoutBench-COCO - Position

LayoutBench-COCO - Size

LayoutBench-COCO - Combination

MENSA (Movie Scene Saliency Dataset)