3,148 machine learning datasets
3,148 dataset results
Overview MILU (Multi-task Indic Language Understanding Benchmark) is a comprehensive evaluation dataset designed to assess the performance of Large Language Models (LLMs) across 11 Indic languages. It spans 8 domains and 42 subjects, reflecting both general and culturally specific knowledge from India.
This dataset was curated for Search Engine Optimization (SEO) analysis tasks, including categorization and spam detection. It covers 12 diverse topics: basketball, books, cats, gardening, history, movies, music, recipes, sports, technology, travel, and weather. Some topics have hierarchical relationships, such as sports and basketball, while others are closely related (e.g., movies and music) or unrelated (e.g., basketball and gardening), with varying degrees of overlap among them. For each topic, approximately 300 search queries were generated using large language models (LLMs) like GPT, Llama, and Claude. The top 10 URLs from the Google Search Console’s search engine results page (SERP) were retrieved for each query.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
A comprehensive Turkish dataset for question-answering tasks in medical domain
Code and Data for Replication of "Microsimulation Estimates of Decision Uncertainty and Value of Information Are Biased but Consistent"
We introduce a dataset consisting of 1314 samples, including users’ tweets and bios. The user’s job title is found using Wikipedia crawling. The challenge of multiple job titles per user is handled using a semantic word embedding and clustering method. Then, a job prediction method is introduced based on a deep neural network and TF-IDF word embedding. We also use hashtags and emojis in the tweets for job prediction. Results show that the job title of users in Twitter could be well predicted with 54% accuracy in nine categories.
Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.
SoliDiffy Differencing Contract Pairs and Edit Scripts Dataset The project creates and maintains two main datasets to assist with research and evaluation of Solidity smart contract differencing:
GPTKB is a large general-domain knowledge base (KB) constructed entirely from a large language model (LLM). It demonstrates the feasibility of large-scale KB construction from LLMs, while highlighting specific challenges arising around entity recognition, entity and property canonicalization, and taxonomy construction.
A large collection of human-written natural language questions and their corresponding SPARQL queries over federated bioinformatics knowledge graphs (KGs) collected for several years across different research groups at the SIB Swiss Institute of Bioinformatics. The collection comprises more than 1000 example questions and queries, including 65 federated queries. We propose a methodology to uniformly represent the examples with minimal metadata, based on existing standards. Furthermore, we introduce an extensive set of open-source applications, including query graph visualizations and smart query editors, easily reusable by KG maintainers who adopt the proposed methodology.
COMFORT is an evaluation protocol to systematically assess the spatial reasoning capabilities of VLMs.
This corpus contains data files that were generated as part of the NOVIC paper (see above). This includes the complete Object Noun Dictionary, the exact templates used for the multiset prompt templating strategy, and a large dataset of 1.8M LLM-generated and templated captions assorted by target noun. The captions were generated based on all of the target nouns in the Object Noun Dictionary.
DAVIS-Edit is a curated testing benchmark for video editing. This dataset contains two evaluation settings, i.e., text- and image-based editing. Besides, it offers two types of annotated for both modalities of prompts, considering the editing scenarios with similar (DAVIS-Edit-S) and changing (DAVIS-Edit-C) shapes, so as to address the shape inconsistency problem in video-to-video editing.
Overview: This collection contains three synthetic datasets produced by gpt-4o-mini for sentiment analysis and PDT (Product Desirability Toolkit) testing. Each dataset contains 1000 hypothetical software product reviews with the aim to produce a diversity of sentiment and text. The datasets were created as part of the research described in:
Large language models (LLMs) excel in high-resource languages but face notable challenges in low-resource languages like Mongolian. The release of MM-Eval, comprising 569 syntax, 677 semantics, 344 knowledge, and 250 reasoning tasks, offers valuable insights for advancing NLP and LLMs in low-resource languages like Mongolian.
AdvSuffixes - Information AdvSuffixes is a curated dataset of adversarial prompts and suffixes designed to evaluate and enhance the robustness of large language models (LLMs) against adversarial attacks. By appending these suffixes to standard prompts, researchers and developers can explore and analyze how LLMs respond to potentially harmful input scenarios. This dataset is heavily inspired by AdvBench.
This dataset provides a curated collection of approved drug Simplified Molecular Input Line Entry System (SMILES) strings and their associated protein sequences. Each small molecule has been approved by at least one regulatory body, ensuring the safety and relevance of the data for computational applications. The dataset includes 1,660 approved small molecules and their 2,093 related protein targets.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The dataset contains 140 paragraphs from climate change reports with associated aspect-based (i.e. query-focused) summaries, that were produced by experts especially for policy-makers.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).