TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Datasets

3,148 machine learning datasets

Filter by Modality

  • Images3,275
  • Texts3,148
  • Videos1,019
  • Audio486
  • Medical395
  • 3D383
  • Time series298
  • Graphs285
  • Tabular271
  • Speech199
  • RGB-D192
  • Environment148
  • Point cloud135
  • Biomedical123
  • LiDAR95
  • RGB Video87
  • Tracking78
  • Biology71
  • Actions68
  • 3d meshes65
  • Tables52
  • Music48
  • EEG45
  • Hyperspectral images45
  • Stereo44
  • MRI39
  • Physics32
  • Interactive29
  • Dialog25
  • Midi22
  • 6D17
  • Replay data11
  • Financial10
  • Ranking10
  • Cad9
  • fMRI7
  • Parallel6
  • Lyrics2
  • PSG2
Clear filter

3,148 dataset results

THAR Dataset (Targeted Hate Speech Against Religion)

The increase in religiously motivated hate on social media is clear and ongoing. These platforms have become fertile ground for the dissemination of hate speech directed at religious communities, resulting in tangible repercussions in the real world. Much of the current research concerning the automated identification of hateful content on social media focuses on English-language content. There is comparatively less exploration in low-resource languages such as Hindi. As social media users increasingly utilize their regional languages for expression, it becomes crucial to dedicate appropriate research efforts to hate speech detection in these languages.

0 papers0 benchmarksTexts

MIMIC Meme Dataset (Misogyny Identification in Multimodal Internet Content in Hindi-English Code-Mix Language)

This dataset endeavors to fill the research void by presenting a meticulously curated collection of misogynistic memes in a code-mixed language of Hindi and English. It introduces two sub-tasks: the first entails a binary classification to determine the presence of misogyny in a meme, while the second task involves categorizing the misogynistic memes into multiple labels, including Objectification, Prejudice, and Humiliation.

0 papers0 benchmarksImages, Texts

Vript (🎬 Vript: A Video Is Worth Thousands of Words)

We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.

0 papers0 benchmarksTexts, Videos

Arena-Hard

The Arena-Hard benchmark is a high-quality benchmarking tool for Language Learning Models (LLMs) developed by LMSYS Org¹. It was designed to address the limitations of traditional benchmarks, which are often static or close-ended¹.

0 papers0 benchmarksTexts

NIAN (Needle in a Needlestack)

The Needle in a Needlestack (NIAN) is a new benchmark designed to measure how well Language Learning Models (LLMs) pay attention to the information in their context window¹.

0 papers0 benchmarksTexts

GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation

GenAI-Bench benchmark consists of 1,600 challenging real-world text prompts sourced from professional designers. Compared to benchmarks such as PartiPrompt and T2I-CompBench, GenAI-Bench captures a wider range of aspects in the compositional text-to-visual generation, ranging from basic (scene, attribute, relation) to advanced (counting, comparison, differentiation, logic). GenAI-Bench benchmark also collects human alignment ratings (1-to-5 Likert scales) on images and videos generated by ten leading models, such as Stable Diffusion, DALL-E 3, Midjourney v6, Pika v1, and Gen2.

0 papers0 benchmarksImages, Texts, Videos

Multimodal Needle in a Haystack (MMNeedle)

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evalu

0 papers0 benchmarksImages, Texts

IndirectRequests

IndirectRequests is an LLM-generated dataset of user utterances in a task-oriented dialogue setting where the user does not directly specify their preferred slot value.

0 papers0 benchmarksTexts

AutoRobust

This dataset is comprised of the dynamic analysis reports generated by CAPEv2, from both malware and goodware. We source the goodware as they do in Dambra et al. (https://arxiv.org/abs/2307.14657), where trough the community-maintained packages of Chocolatey they create a dataset that spans 2012 to 2020. The malware are sourced from VirusTotal, namely samples of Portable Executable from 2017 - 2020 that they release for academic purposes. In total, the dataset we assembled contains 26,200 PE samples: 8,600 (33\%) goodware and 17,675 (67\%) malware.

0 papers0 benchmarksTexts

AntM2C (Ant-Group Multi-Scenario Multi-Modal CTR dataset)

We release a large-scale Multi-Scenario Multi-Modal CTR dataset named AntM2C, built from real industrial data from Alipay. This dataset offers an impressive breadth and depth of information, covering CTR data from four diverse business scenarios, including advertisements, consumer coupons, mini-programs, and videos. Unlike existing datasets, AntM2C provides not only ID-based features but also five textual features and one image feature for both users and items, supporting more delicate multi-modal CTR prediction.

0 papers0 benchmarksImages, Tabular, Texts

InpaintCOCO

InpaintCOCO is a benchmark to understand fine-grained concepts in multimodal models (vision-language) similar to Winoground. To our knowledge InpaintCOCO is the first benchmark, which consists of image pairs with minimum differences, so that the visual representation can be analyzed in a more standardized setting.

0 papers0 benchmarksImages, Texts

POPCORN (POPCORN: Fictional and Synthetic Intelligence Reports for Named Entity Recognition and Relation Extraction Tasks)

POPCORN is a French dataset consisting of 400 validation texts and 400 training texts, all written and annotated manually. The texts are concise and factual, resembling information reports. The annotations, based on the ontology described below, allow for the training and evaluation of models in Information Extraction tasks, including Named Entity Recognition, Coreference Resolution, and Relation Extraction.

0 papers0 benchmarksTexts

CASIA-CXR

Medical report generation (MRG), which aims to automatically generate a textual description of a specific medical image (e.g., a chest X-ray), has recently received increasing research interest. Building on the success of image captioning, MRG has become achievable. However, generating language-specific radiology reports poses a challenge for data-driven models due to their reliance on paired image-report chest X-ray datasets, which are labor-intensive, time-consuming, and costly. In this paper, we introduce a chest X-ray benchmark dataset, namely CASIA-CXR, consisting of high-resolution chest radiographs accompanied by narrative reports originally written in French. To the best of our knowledge, this is the first public chest radiograph dataset with medical reports in this particular language. Importantly, we propose a simple yet effective multimodal encoder-decoder contextually-guided framework for medical report generation in French. We validated our framework through intra-language

0 papers0 benchmarksImages, Medical, Texts

CodeSCAN (ScreenCast ANalysis for Video Programming Tutorials)

CodeSCAN is the first large-scale and diverse dataset of coding screenshots with pixel-perfect annotations. It features:

0 papers0 benchmarksImages, Texts

Flexural Capacity Database for reinforced UHPC Beams

A high-quality dataset forms the foundation for machine learning-based predictions of structural load capacity. Therefore, this study collected 222 sets of rectangular section reinforced UHPC beams flexural capacity test data,The key factors that affect the UHPC beam’s flexural capacity include geometric parameters, material properties, and reinforcement details. Eight fundamental structural features that significantly impact the flexural capacity of UHPC beams were selected as input parameters for machine learning: beam section width (b), beam section height (h), compressive strength of UHPC cubes (fcu), tensile strength of UHPC (ft), volume fraction of steel fibers (Vf), aspect ratio of steel fibers (Lf / Df), longitudinal reinforcement ratio (ρs), and yield strength of longitudinal reinforcement (fy). The output is the flexural capacity (Mu).

0 papers0 benchmarksTexts

MIE Articles Dataset

This dataset contains 4606 articles from 1996 to 2024 that were presented in MIE (Medical Informatics Europe Conference) conferences. This data was extracted from PubMed and topic extraction and affiliation parsing were done on it.

0 papers0 benchmarksTexts

TS-TR (Turkish Scene Text Recognition Dataset)

The Turkish Scene Text Recognition (TS-TR) dataset was primarily developed to fill the gap in non-English text recognition resources, specifically addressing the unique challenges presented by the Turkish language, such as special characters and diacritics. This dataset mirrors real-world conditions with texts displayed in various fonts, sizes, orientations, and complex backgrounds from multiple urban and rural environments. Such diversity ensures the training of models that can generalize across different scenarios, including varying lighting conditions and complex visual layouts.

0 papers0 benchmarksImages, Texts

SILD (Survey Item Linking Dataset)

This dataset contains a collection of texts from publications from a broad range of social science domains (e.g., economics, politics, psychology, etc.). The texts are annotated with labels for Survey Item Linking (SIL), an Entity Linking (EL) task. SIL is divided into two sub-tasks: Mention Detection (MD), a binary text classification task, and Entity Disambiguation (ED), a sentence similarity task. Sentences that mention survey items are labeled with the IDs of entities from a knowledge base (GSIM). SILD contains 20,454 sentences in English and German from 100 publications.

0 papers0 benchmarksTexts

English-Pashto Language Dataset (EPLD)

The English-Pashto Language Dataset (EPLD) is a comprehensive resource aimed to provide linguistic insights into the Pashto language. It contains the knowledge and study of Pashto language with the basics of communication like counting, alphabets, pronoun, basic sentences used in everyday life. Every data is translated from English to Pashto for better human understanding and clarity. The data is carefully proofread and verified by the native speakers and the language experts. Pashto language has multiple variations and accents depending on the geographical factors. This dataset explains and addresses the key differences of words and sounds of Pashto, which may sound similar or different from English on the basis of gender, tense of the statement, relationship of the speaker etc. This dataset is designed to support language learning, natural language processing (NLP) research and computational linguistic studies focusing on Pashto language.

0 papers0 benchmarksTabular, Texts

MM-IQ

IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.

0 papers0 benchmarksImages, Texts
PreviousPage 157 of 158Next