19,997 machine learning datasets
19,997 dataset results
We introduce Dynamic Text-attributed Graph Benchmark (DTGB), a collection of large-scale, time-evolving graphs from diverse domains, with nodes and edges enriched by dynamically changing text attributes and categories. To facilitate the use of DTGB, we design standardized evaluation procedures based on four real-world use cases: future link prediction, destination node retrieval, edge classification, and textual relation generation. These tasks require models to understand both dynamic graph structures and natural language, highlighting the unique challenges posed by DyTAGs.
Underground hacking forums
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Search Engine Optimization (SEO) attributes provide strong signals for predicting news site reliability. We introduce a novel attributed webgraph dataset with labeled news domains and their connections to outlinking and backlinking domains. Finally, we introduce and evaluate a novel graph-based algorithm for discovering previously unknown misinformation news sources.
VD4UAV is an altitude-sensitive benchmark dataset designed to evade vehicle detection in Unmanned Aerial Vehicle (UAV) imagery. This dataset is specifically curated to facilitate the study of adversarial patch-based vehicle detection attacks in UAV images. The EVD4UAV dataset comprises a diverse set of images captured at various altitudes with fine-grained annotations, making it a robust platform for evaluating the performance of object detectors under adversarial conditions. Notably, the dataset includes around 3,000 images depicting winter scenarios where vehicles may be partially or fully covered by snow, providing a unique challenge for vehicle detection algorithms.
Multi-modal sarcasm detection has attracted much recent attention. Nevertheless, the existing benchmark (MMSD) has some shortcomings that hinder the development of reliable multi-modal sarcasm detection system:(1) There are some spurious cues in MMSD, leading to the model bias learning; (2) The negative samples in MMSD are not always reasonable.To solve the aforementioned issues, we introduce MMSD2.0, a correction dataset that fixes the shortcomings of MMSD, by removing the spurious cues and re-annotating the unreasonable samples.Meanwhile, we present a novel framework called multi-view CLIP that is capable of leveraging multi-grained cues from multiple perspectives (i.e., text, image, and text-image interaction view) for multi-modal sarcasm detection.Extensive experiments show that MMSD2.0 is a valuable benchmark for building reliable multi-modal sarcasm detection systems and multi-view CLIP can significantly outperform the previous best baselines (with a 5.6% improvement).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Abstract
Introduction Generalized quantifiers (e.g., few, most) are used to indicate the proportions predicates are satisfied. QuRe is quantifier reasoning dataset from Pragmatic Reasoning Unlocks Quantifier Semantics for Foundation Models. It includes real-world sentences from Wikipedia and human annotations of generalized quantifiers from English speakers.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reve
In this paper, we introduce a novel benchmarking framework designed specifically for evaluations of data science agents. Our contributions are three-fold. First, we propose DSEval, an evaluation paradigm that enlarges the evaluation scope to the full lifecycle of LLM-based data science agents. We also cover aspects including but not limited to the quality of the derived analytical solutions or machine learning models, as well as potential side effects such as unintentional changes to the original data. Second, we incorporate a novel bootstrapped annotation process letting LLM themselves generate and annotate the benchmarks with ``human in the loop''. A novel language (i.e., DSEAL) has been proposed and the derived four benchmarks have significantly improved the benchmark scalability and coverage, with largely reduced human labor. Third, based on DSEval and the four benchmarks, we conduct a comprehensive evaluation of various data science agents from different aspects. Our findings reve
```markdown 1. 1.01 million outfits, 583K fashion items, with context information. 2. 0.28 billion user click actions from 3.57 million users.
City Street: We collected a multi-view video dataset of a busy city street using 5 synchronized cameras. The videos are about 1 hour long with 2.7k (2704×1520) resolution at 30 fps. We select Cameras 1, 3 and 4 for the experiment (see Fig. 6 bottom). The cameras’ intrinsic and extrinsic parameters are estimated using the calibration algorithm from [52]. 500 multi-view images are uniformly sampled from the videos, and the first 300 are used for training and remaining 200 for testing. The ground-truth 2D and 3D annotations are obtained as follows. The head positions of the first camera-view are annotated manually, and then projected to other views and adjusted manually. Next, for the second camera view, new people (not seen in the first view), are also annotated and then projected to the other views. This process is repeated until all people in the scene are annotated and associated across all camera views. Our dataset has larger crowd numbers (70-150), compared with PETS (20-40) and Duk
The datasets contains 1,420 human annotated product offers, systematically selected from the Web Data Commons Product Matching Corpus, featuring 24,582 annotated attribute-value pairs, making it a valuable resource for both product attribute-value extraction and product matching tasks. The normalized gold standard contains the standardized attribute value pairs as described below.
EconLogicQA is a benchmark designed to test the sequential reasoning skills of large language models (LLMs) in economics, business, and supply chain management. It diverges from typical benchmarks by requiring models to understand and sequence multiple interconnected events, capturing complex economic logics. The benchmark includes multi-event scenarios and a thorough suite of evaluations to assess proficiency in economic contexts.
The HOI-Synth benchmark extends three egocentric datasets designed to study hand-object interaction detection, EPIC-KITCHENS VISOR, EgoHOS, and ENIGMA-51, with automatically labeled synthetic data obtained through a novel HOI generation pipeline.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).