Papers With Code 2 | ML Benchmarks, SotA Results & Code

DTGB (Dynamic Text-attributed Graph Benchmark)

We introduce Dynamic Text-attributed Graph Benchmark (DTGB), a collection of large-scale, time-evolving graphs from diverse domains, with nodes and edges enriched by dynamically changing text attributes and categories. To facilitate the use of DTGB, we design standardized evaluation procedures based on four real-world use cases: future link prediction, destination node retrieval, edge classification, and textual relation generation. These tasks require models to understand both dynamic graph structures and natural language, highlighting the unique challenges posed by DyTAGs.

2 papers0 benchmarksGraphs, Texts, Time series

CrimeBB

Underground hacking forums

2 papers0 benchmarks

RuWiki-Good

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

PostNauka

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

News SEO Dataset (Detection and Discovery of Misinformation Sources using Attributed Webgraphs)

Search Engine Optimization (SEO) attributes provide strong signals for predicting news site reliability. We introduce a novel attributed webgraph dataset with labeled news domains and their connections to outlinking and backlinking domains. Finally, we introduce and evaluate a novel graph-based algorithm for discovering previously unknown misinformation news sources.

2 papers0 benchmarksGraphs

EVD4UAV

VD4UAV is an altitude-sensitive benchmark dataset designed to evade vehicle detection in Unmanned Aerial Vehicle (UAV) imagery. This dataset is specifically curated to facilitate the study of adversarial patch-based vehicle detection attacks in UAV images. The EVD4UAV dataset comprises a diverse set of images captured at various altitudes with fine-grained annotations, making it a robust platform for evaluating the performance of object detectors under adversarial conditions. Notably, the dataset includes around 3,000 images depicting winter scenarios where vehicles may be partially or fully covered by snow, providing a unique challenge for vehicle detection algorithms.

2 papers7 benchmarksImages

MMSD2.0 (Towards a Reliable Multi-modal Sarcasm Detection System)

Multi-modal sarcasm detection has attracted much recent attention. Nevertheless, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection system:(1) There are some spurious cues in MMSD, leading to the model bias learning; (2) The negative samples in MMSD are not always reasonable.To solve the aforementioned issues, we introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD, by removing the spurious cues and re-annotating the unreasonable samples.Meanwhile, we present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection.Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems and multi-view CLIP can significantly outperform the previous best baselines (with a 5.6% improvement).

2 papers0 benchmarksImages, Texts

ZIQI-Eval

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

Radar Dataset (DIAT-μRadHAR: Radar micro-Doppler Signature dataset for Human Suspicious Activity Recognition)

Abstract

2 papers4 benchmarks

QuRe

Introduction Generalized quantifiers (e.g., few, most) are used to indicate the proportions predicates are satisfied. QuRe is quantifier reasoning dataset from Pragmatic Reasoning Unlocks Quantifier Semantics for Foundation Models. It includes real-world sentences from Wikipedia and human annotations of generalized quantifiers from English speakers.

2 papers0 benchmarksTexts

PCB-Bank

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

E.T. the Exceptional Trajectories

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers6 benchmarks3D, 3d meshes, Texts, Videos

DSEval-Exercise

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reve

2 papers0 benchmarks

DSEval-SO

In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reve

2 papers0 benchmarks

iFashion Alibaba (iFashion Alibaba: Personalized outfits)

```markdown 1. 1.01 million outfits, 583K fashion items, with context information. 2. 0.28 billion user click actions from 3.57 million users.

2 papers0 benchmarks

City Street

City Street: We collected a multi-view video dataset of a busy city street using 5 synchronized cameras. The videos are about 1 hour long with 2.7k (2704×1520) resolution at 30 fps. We select Cameras 1, 3 and 4 for the experiment (see Fig. 6 bottom). The cameras’ intrinsic and extrinsic parameters are estimated using the calibration algorithm from [52]. 500 multi-view images are uniformly sampled from the videos, and the first 300 are used for training and remaining 200 for testing. The ground-truth 2D and 3D annotations are obtained as follows. The head positions of the first camera-view are annotated manually, and then projected to other views and adjusted manually. Next, for the second camera view, new people (not seen in the first view), are also annotated and then projected to the other views. This process is repeated until all people in the scene are annotated and associated across all camera views. Our dataset has larger crowd numbers (70-150), compared with PETS (20-40) and Duk

2 papers0 benchmarks

WDC-PAVE (Web Data Commones - Product Attribute Value Extraction)

The datasets contains 1,420 human annotated product offers, systematically selected from the Web Data Commons Product Matching Corpus, featuring 24,582 annotated attribute-value pairs, making it a valuable resource for both product attribute-value extraction and product matching tasks. The normalized gold standard contains the standardized attribute value pairs as described below.

2 papers2 benchmarksTexts

EconLogicQA

EconLogicQA is a benchmark designed to test the sequential reasoning skills of large language models (LLMs) in economics, business, and supply chain management. It diverges from typical benchmarks by requiring models to understand and sequence multiple interconnected events, capturing complex economic logics. The benchmark includes multi-event scenarios and a thorough suite of evaluations to assess proficiency in economic contexts.

2 papers1 benchmarks

HOI-Synth (HOI-Synth benchmark)

The HOI-Synth benchmark extends three egocentric datasets designed to study hand-object interaction detection, EPIC-KITCHENS VISOR, EgoHOS, and ENIGMA-51, with automatically labeled synthetic data obtained through a novel HOI generation pipeline.

2 papers0 benchmarksImages, RGB-D

MMVR (Millimeter-wave Multi-View Radar (MMVR) Dataset)

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarksEnvironment

Datasets

DTGB (Dynamic Text-attributed Graph Benchmark)

CrimeBB

RuWiki-Good

PostNauka

News SEO Dataset (Detection and Discovery of Misinformation Sources using Attributed Webgraphs)

EVD4UAV

MMSD2.0 (Towards a Reliable Multi-modal Sarcasm Detection System)

ZIQI-Eval

Radar Dataset (DIAT-μRadHAR: Radar micro-Doppler Signature dataset for Human Suspicious Activity Recognition)

QuRe

PCB-Bank

E.T. the Exceptional Trajectories

DSEval-Exercise

DSEval-SO

iFashion Alibaba (iFashion Alibaba: Personalized outfits)

City Street

WDC-PAVE (Web Data Commones - Product Attribute Value Extraction)

EconLogicQA

HOI-Synth (HOI-Synth benchmark)

MMVR (Millimeter-wave Multi-View Radar (MMVR) Dataset)

Datasets

DTGB (Dynamic Text-attributed Graph Benchmark)

CrimeBB

RuWiki-Good

PostNauka

News SEO Dataset (Detection and Discovery of Misinformation Sources using Attributed Webgraphs)

EVD4UAV

MMSD2.0 (Towards a Reliable Multi-modal Sarcasm Detection System)

ZIQI-Eval

Radar Dataset (DIAT-μRadHAR: Radar micro-Doppler Signature dataset for Human Suspicious Activity Recognition)

QuRe

PCB-Bank

E.T. the Exceptional Trajectories

DSEval-Exercise

DSEval-SO

iFashion Alibaba (iFashion Alibaba: Personalized outfits)

City Street

WDC-PAVE (Web Data Commones - Product Attribute Value Extraction)

EconLogicQA

HOI-Synth (HOI-Synth benchmark)

MMVR (Millimeter-wave Multi-View Radar (MMVR) Dataset)