Datasets

3,148 machine learning datasets

3,148 dataset results

vqa-nle-llava

VQA NLE synthetic dataset, made with LLaVA-1.5 using features from GQA dataset. Total number of unique datas: 66684

YesBut Dataset (https://yesbut-dataset.github.io) Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Sh

1 papers0 benchmarksImages, Texts

MMInstruct-GPT4V (MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity)

Vision-language supervised fine-tuning effectively enhances VLLM performance, but existing visual instruction tuning datasets have limitations:

1 papers0 benchmarksImages, Texts

ReALFRED

realfred is an embodied instruction following benchmark.

1 papers0 benchmarksImages, Texts

CAS-VSR-S101

A new large-scale, in-thewild Mandarin dataset, CAS-VSR-S101 with 101.1 hours of data. The videos are sourced from broadcast news and conversational programs in Chinese, covering a highly diverse set of topics, speakers and filming conditions. The lengths of the utterances are naturally distributed between 0.01s and 10.57s, and image qualities and resolutions vary. News accounts for 82.4% of the programs. 70.4% of the utterances depict news anchors, hosts and correspondents, while 29.6% are those of interviewees and guests. In addition, at a ratio of approximately 1.5 : 1, male and female appearances are relatively balanced. It is divided into train, validation and test sets by TV channels to minimize speaker overlap, and at a ratio of roughly 8 : 1 : 1.5 in terms of duration. The validation and test sets are composed of programs broadcast on provincial TV channels. The dataset is available for academic use under a license.

1 papers4 benchmarksAudio, Speech, Texts, Videos

DalleStreet

A dataset of images obtained from DALL-E 3 for 67 countries and 10 concept classes, similar to DollarStreet images.

1 papers0 benchmarksImages, Texts

WorldCuisines

Vision Language Models (VLMs) often struggle with culture-specific knowledge, particularly in languages other than English and in underrepresented cultural contexts. To evaluate their understanding of such knowledge, we introduce WorldCuisines, a massive-scale benchmark for multilingual and multicultural, visually grounded language understanding. This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date. It includes tasks for identifying dish names and their origins. We provide evaluation datasets in two sizes (12k and 60k instances) alongside a training dataset (1 million instances). Our findings show that while VLMs perform better with correct location context, they struggle with adversarial contexts and predicting specific regional cuisines and languages. To support future research, we release

1 papers0 benchmarksImages, Texts

SemiEvol

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

1 papers0 benchmarksTexts

decompile-ghidra-100k

Release decompile-ghidra-100k, a subset of 100k training samples (25k per optimization level). We provide a training script that runs in ~3.5 hours on a single A100 40G GPU. It achieves a 0.26 re-executability rate, with a total cost of under $20 for quick replication of LLM4Decompile.

1 papers0 benchmarksTexts

RaTE-NER

RaTE-NER dataset is a large-scale, radiological named entity recognition (NER) dataset, including 13,235 manually annotated sentences from 1,816 reports within the MIMIC-IV database, that spans 9 imaging modalities and 23 anatomical regions, ensuring comprehensive coverage.

1 papers0 benchmarksMedical, Texts

Dhoroni (Dhoroni: A Multi-Perspective Bengali Climate Change and Environmental News Dataset)

Climate change poses critical challenges globally, disproportionately affecting low-income countries that often lack resources and linguistic representation on the international stage. Despite Bangladesh's status as one of the most vulnerable nations to climate impacts, research gaps persist in Bengali-language studies related to climate change and NLP. To address this disparity, we introduce ধরণী (Dhoroni), a novel Bengali (Bangla) climate change and environmental news dataset, comprising a 2300 annotated Bangla news articles, offering multiple perspectives such as political influence, scientific/statistical data, authenticity, stance detection, and stakeholder involvement. Furthermore, we present an in-depth exploratory analysis of Dhoroni and introduce BanglaBERT-Dhoroni family, a novel baseline family for climate stance detection in Bangla, fine-tuned on our dataset. This research contributes significantly to enhancing accessibility and analysis of climate discourse in Bengali (Ban

1 papers4 benchmarksTexts

UKIL-DB-EN

Bangladesh's legal system struggles with major challenges like delays, complexity, high costs, and millions of unresolved cases, which deter many from pursuing legal action due to lack of knowledge or financial constraints. This research seeks to develop a specialized Large Language Model (LLM) to assist in the Bangladeshi legal system. We created UKIL-DB-EN, an English corpus of Bangladeshi legal documents, by collecting and scraping data on various legal acts. We fine-tuned the GPT-2 model on this dataset to develop GPT2-UKIL-EN, an LLM focused on providing legal assistance in English. The model was rigorously evaluated using semantic assessments, including case studies supported by expert opinions. The evaluation provided promising results, demonstrating the potential for the model to assist in legal matters within Bangladesh. Our work represents the first structured effort toward building an AI-based legal assistant for Bangladesh. While the results are encouraging, further refinem

1 papers0 benchmarksTexts

PolyMATH

PolyMATH, a challenging benchmark aimed at evaluating the general cognitive reasoning abilities of MLLMs. PolyMATH comprises 5,000 manually collected high-quality images of cognitive textual and visual challenges across 10 distinct categories, including pattern recognition, spatial reasoning, and relative reasoning

1 papers0 benchmarksImages, Texts

Keyphrases CS&Math Russian

Dataset contains CS/Math articles abstracts (in Russian) obtained from two online sources. For each article publication year, journal name, authors, title, keyphrases and abstract are provided.

1 papers0 benchmarksTexts

misinfo-general

We introduce misinfo-general, a benchmark dataset for evaluating misinformation models’ ability to perform out-of-distribution generalisation. Misinformation changes rapidly, much quicker than moderators can annotate at scale, resulting in a shift between the training and inference data distributions. As a result, misinformation models need to be able to perform out-of-distribution generalisation, an understudied problem in existing datasets.

1 papers0 benchmarksTexts

Car_Price_Prediction (Second_Hand-Car_Price_Prediction)

In this dataset we added [Company Name, Car Model, Car Type, Fuel Type, Transmission, Engine (cc), Mileage, Kms_driven, Buyers, Horsepower (kw), Year Price (Lakhs)]

1 papers1 benchmarksFinancial, Texts

CCI 3.0-HQ

To address the scarcity of high-quality safety datasets in the Chinese, we open-sourced the CCI (Chinese Corpora Internet) dataset on November 29, 2023. Building on this foundation, we continue to expand the data source, adopt stricter data cleaning methods, and complete the construction of the CCI 3.0 dataset. This dataset is composed of high-quality, reliable Internet data from trusted sources. And then with more stricter filtering, The CCI 3.0 HQ corpus released is about 500GB in size.

1 papers0 benchmarksTexts

Bengali Social Media Depressive Dataset (BSMDD)

Our dataset, BSMDD, was collected from various open social media platforms and translated and annotated by native Bengali speakers with expertise in both language and mental health. It contains 21,910 cleaned samples, including 10,961 labeled as Depressed and 10,949 as Non-Depressed. The dataset is publicly accessible, providing a valuable resource for further research in depression detection in Bengali social media content. The expert annotation process, conducted by professionals, ensures high validity, making BSMDD particularly important for advancing mental health research through social media analysis. This dataset is also published on Mendeley.

1 papers0 benchmarksMedical, Texts

Perfume Co-Preference Network

The Perfume Co-Preference Network dataset comprises comprehensive user reviews and ratings collected from the Persian retail platform Atrafshan. This dataset, central to our research on community detection in fragrance preferences, includes 36,434 comments from 7,387 unique users, providing insights into consumer sentiment towards various perfumes. It is designed to facilitate the analysis of user preferences through sentiment analysis, allowing for the clustering of perfumes based on shared attributes.

1 papers0 benchmarksGraphs, Tables, Texts

The Write & Improve Corpus 2024

We present a new annotated corpus of written learner English, derived from essays submitted to the learning platform Write & Improve (W&I). Users of W&I are presented with automated scoring and feedback on grammatical errors, and are encouraged to act on their error feedback, submitting multiple versions of their essays for any given prompt. We build the corpus on this interplay between users and prompts, collecting sets of essays submitted by users for a selected list of 50 popular prompts. The prompts include 20 aimed at beginner learners of English, 20 aimed at intermediate learners, and 10 at advanced learners. This distribution reflects the greater use of W&I by beginner and intermediate learners of English. We ensured that the prompts were not likely to elicit personal information and covered a broad range of tasks and topics. This list of prompts enabled us to identify 5050 essay sets written by 766 users, forming the basis for the Write & Improve Corpus, which is being made ava

1 papers0 benchmarksTexts

PreviousPage 142 of 158Next