3,148 machine learning datasets
3,148 dataset results
The increase in religiously motivated hate on social media is clear and ongoing. These platforms have become fertile ground for the dissemination of hate speech directed at religious communities, resulting in tangible repercussions in the real world. Much of the current research concerning the automated identification of hateful content on social media focuses on English-language content. There is comparatively less exploration in low-resource languages such as Hindi. As social media users increasingly utilize their regional languages for expression, it becomes crucial to dedicate appropriate research efforts to hate speech detection in these languages.
This dataset endeavors to fill the research void by presenting a meticulously curated collection of misogynistic memes in a code-mixed language of Hindi and English. It introduces two sub-tasks: the first entails a binary classification to determine the presence of misogyny in a meme, while the second task involves categorizing the misogynistic memes into multiple labels, including Objectification, Prejudice, and Humiliation.
We construct a fine-grained video-text dataset with 12K annotated high-resolution videos (~400k clips). The annotation of this dataset is inspired by the video script. If we want to make a video, we have to first write a script to organize how to shoot the scenes in the videos. To shoot a scene, we need to decide the content, shot type (medium shot, close-up, etc), and how the camera moves (panning, tilting, etc). Therefore, we extend video captioning to video scripting by annotating the videos in the format of video scripts. Different from the previous video-text datasets, we densely annotate the entire videos without discarding any scenes and each scene has a caption with ~145 words. Besides the vision modality, we transcribe the voice-over into text and put it along with the video title to give more background information for annotating the videos.
The Arena-Hard benchmark is a high-quality benchmarking tool for Language Learning Models (LLMs) developed by LMSYS Org¹. It was designed to address the limitations of traditional benchmarks, which are often static or close-ended¹.
The Needle in a Needlestack (NIAN) is a new benchmark designed to measure how well Language Learning Models (LLMs) pay attention to the information in their context window¹.
GenAI-Bench benchmark consists of 1,600 challenging real-world text prompts sourced from professional designers. Compared to benchmarks such as PartiPrompt and T2I-CompBench, GenAI-Bench captures a wider range of aspects in the compositional text-to-visual generation, ranging from basic (scene, attribute, relation) to advanced (counting, comparison, differentiation, logic). GenAI-Bench benchmark also collects human alignment ratings (1-to-5 Likert scales) on images and videos generated by ten leading models, such as Stable Diffusion, DALL-E 3, Midjourney v6, Pika v1, and Gen2.
Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evalu
IndirectRequests is an LLM-generated dataset of user utterances in a task-oriented dialogue setting where the user does not directly specify their preferred slot value.
This dataset is comprised of the dynamic analysis reports generated by CAPEv2, from both malware and goodware. We source the goodware as they do in Dambra et al. (https://arxiv.org/abs/2307.14657), where trough the community-maintained packages of Chocolatey they create a dataset that spans 2012 to 2020. The malware are sourced from VirusTotal, namely samples of Portable Executable from 2017 - 2020 that they release for academic purposes. In total, the dataset we assembled contains 26,200 PE samples: 8,600 (33\%) goodware and 17,675 (67\%) malware.
We release a large-scale Multi-Scenario Multi-Modal CTR dataset named AntM2C, built from real industrial data from Alipay. This dataset offers an impressive breadth and depth of information, covering CTR data from four diverse business scenarios, including advertisements, consumer coupons, mini-programs, and videos. Unlike existing datasets, AntM2C provides not only ID-based features but also five textual features and one image feature for both users and items, supporting more delicate multi-modal CTR prediction.
InpaintCOCO is a benchmark to understand fine-grained concepts in multimodal models (vision-language) similar to Winoground. To our knowledge InpaintCOCO is the first benchmark, which consists of image pairs with minimum differences, so that the visual representation can be analyzed in a more standardized setting.
POPCORN is a French dataset consisting of 400 validation texts and 400 training texts, all written and annotated manually. The texts are concise and factual, resembling information reports. The annotations, based on the ontology described below, allow for the training and evaluation of models in Information Extraction tasks, including Named Entity Recognition, Coreference Resolution, and Relation Extraction.
Medical report generation (MRG), which aims to automatically generate a textual description of a specific medical image (e.g., a chest X-ray), has recently received increasing research interest. Building on the success of image captioning, MRG has become achievable. However, generating language-specific radiology reports poses a challenge for data-driven models due to their reliance on paired image-report chest X-ray datasets, which are labor-intensive, time-consuming, and costly. In this paper, we introduce a chest X-ray benchmark dataset, namely CASIA-CXR, consisting of high-resolution chest radiographs accompanied by narrative reports originally written in French. To the best of our knowledge, this is the first public chest radiograph dataset with medical reports in this particular language. Importantly, we propose a simple yet effective multimodal encoder-decoder contextually-guided framework for medical report generation in French. We validated our framework through intra-language
CodeSCAN is the first large-scale and diverse dataset of coding screenshots with pixel-perfect annotations. It features:
A high-quality dataset forms the foundation for machine learning-based predictions of structural load capacity. Therefore, this study collected 222 sets of rectangular section reinforced UHPC beams flexural capacity test data,The key factors that affect the UHPC beam’s flexural capacity include geometric parameters, material properties, and reinforcement details. Eight fundamental structural features that significantly impact the flexural capacity of UHPC beams were selected as input parameters for machine learning: beam section width (b), beam section height (h), compressive strength of UHPC cubes (fcu), tensile strength of UHPC (ft), volume fraction of steel fibers (Vf), aspect ratio of steel fibers (Lf / Df), longitudinal reinforcement ratio (ρs), and yield strength of longitudinal reinforcement (fy). The output is the flexural capacity (Mu).
This dataset contains 4606 articles from 1996 to 2024 that were presented in MIE (Medical Informatics Europe Conference) conferences. This data was extracted from PubMed and topic extraction and affiliation parsing were done on it.
The Turkish Scene Text Recognition (TS-TR) dataset was primarily developed to fill the gap in non-English text recognition resources, specifically addressing the unique challenges presented by the Turkish language, such as special characters and diacritics. This dataset mirrors real-world conditions with texts displayed in various fonts, sizes, orientations, and complex backgrounds from multiple urban and rural environments. Such diversity ensures the training of models that can generalize across different scenarios, including varying lighting conditions and complex visual layouts.
This dataset contains a collection of texts from publications from a broad range of social science domains (e.g., economics, politics, psychology, etc.). The texts are annotated with labels for Survey Item Linking (SIL), an Entity Linking (EL) task. SIL is divided into two sub-tasks: Mention Detection (MD), a binary text classification task, and Entity Disambiguation (ED), a sentence similarity task. Sentences that mention survey items are labeled with the IDs of entities from a knowledge base (GSIM). SILD contains 20,454 sentences in English and German from 100 publications.
The English-Pashto Language Dataset (EPLD) is a comprehensive resource aimed to provide linguistic insights into the Pashto language. It contains the knowledge and study of Pashto language with the basics of communication like counting, alphabets, pronoun, basic sentences used in everyday life. Every data is translated from English to Pashto for better human understanding and clarity. The data is carefully proofread and verified by the native speakers and the language experts. Pashto language has multiple variations and accents depending on the geographical factors. This dataset explains and addresses the key differences of words and sounds of Pashto, which may sound similar or different from English on the basis of gender, tense of the statement, relationship of the speaker etc. This dataset is designed to support language learning, natural language processing (NLP) research and computational linguistic studies focusing on Pashto language.
IQ testing has served as a foundational methodology for evaluating human cognitive capabilities, deliberately decoupling assessment from linguistic background, language proficiency, or domain-specific knowledge to isolate core competencies in abstraction and reasoning. Yet, artificial intelligence research currently lacks systematic benchmarks to quantify these critical cognitive dimensions in multimodal systems. To address this critical gap, we propose MM-IQ, a comprehensive evaluation framework comprising 2,710 meticulously curated test items spanning 8 distinct reasoning paradigms.