Datasets

3,148 machine learning datasets

3,148 dataset results

TACO-BAAI (Topics in Algorithmic Code generation dataset)

TACO (Topics in Algorithmic Code generation dataset) is a dataset focused on algorithmic code generation, designed to provide a more challenging training dataset and evaluation benchmark for the code generation model field. The dataset consists of programming competition problems that are more difficult and closer to real programming scenarios. It emphasizes improving or evaluating the model's understanding and reasoning abilities in practical application scenarios, rather than just implementing predefined function functionalities.

1 papers1 benchmarksTexts

TuPyE-Dataset (Portuguese Hate Speech Expanded Dataset)

TuPyE, an enhanced iteration of TuPy, encompasses a compilation of 43,668 meticulously annotated documents specifically selected for the purpose of hate speech detection within diverse social network contexts. This augmented dataset integrates supplementary annotations and amalgamates with datasets sourced from Fortuna et al. (2019), Leite et al. (2020), and Vargas et al. (2022), complemented by an infusion of 10,000 original documents from the TuPy-Dataset.

1 papers0 benchmarksTexts

Sequential Instructions

This is the sequential instructions dataset from Understanding the Effects of RLHF on LLM Generalisation and Diversity. The dataset is in the alpaca_eval format.

1 papers0 benchmarksTexts

Machine_Mindset_MBTI_dataset

Dataset introduction There are four dimension in MBTI. And there are two opposite attributes within each dimension.

1 papers0 benchmarksTexts

Forex News Annotated Dataset for Sentiment Analysis

This dataset contains news headlines relevant to key forex pairs: AUDUSD, EURCHF, EURUSD, GBPUSD, and USDJPY. The data was extracted from reputable platforms Forex Live and FXstreet over a period of 86 days, from January to May 2023. The dataset comprises 2,291 unique news headlines. Each headline includes an associated forex pair, timestamp, source, author, URL, and the corresponding article text. Data was collected using web scraping techniques executed via a custom service on a virtual machine. This service periodically retrieves the latest news for a specified forex pair (ticker) from each platform, parsing all available information. The collected data is then processed to extract details such as the article's timestamp, author, and URL. The URL is further used to retrieve the full text of each article. This data acquisition process repeats approximately every 15 minutes.

1 papers0 benchmarksTexts

ShortPersianEmo

ShortPersianEmo is a new data set for emotion recognition in Persian short texts. The ShortPersianEmo dataset is a single-label dataset that contains 5472 short Persian texts collected from Twitter and Digikala. Our dataset is annotated according to Rachael Jack’s emotional model in five emotional classes happiness, sadness, anger, fear, and other. Unlike publicly accessible datasets that do not impose any restrictions on text length, ShortPersianEmo specifically focuses on short texts. The average text length in the ShortPersianEmo dataset is 56 words. Table 1 presents a comparison between the introduced ShortPersianEmo dataset and other datasets from the literature for emotion detection in Persian text. For more information on this dataset please read our paper. If you use this dataset in any research work, please cite our paper.

1 papers3 benchmarksTexts

REBUS (A Robust Evaluation Benchmark of Understanding Symbols)

Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini (Gemini Team et al., 2023), they showcased its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi- step reasoning, and understanding the human creator’s intent. We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse cate- gories, including hand-drawn and digital images created by nine contributors. Samples are presented in Table 1. Notably, GPT-4V, the most powe

1 papers1 benchmarksImages, Texts

EAPD (Expert-labeled Aesthetics Perception Database)

An expert benchmark aiming to comprehensively evaluate the aesthetic perception capacities of MLLMs.

1 papers0 benchmarksImages, Texts

HarmfulTasks (Harmful and Malicious Tasks for LLMs in Jailbreaking Prompts)

This dataset consists of 225 malicious tasks, which were integrated into ten distinct jailbreaking prompts. The malicious tasks were divided into five categories, namely,

1 papers0 benchmarksTexts

Super-CLEVR-3D

Super-CLEVR-3D is a visual question answering (VQA) dataset where the questions are about the explicit 3D configuration of the objects from images (i.e. 3D poses, parts, and occlusion). It consists of objects from 5 categories: aeroplanes, buses, bicycles, cars and motorbikes. The rendered objects are from CGParts dataset, with the same setting as Super-CLEVR dataset.

1 papers0 benchmarks3D, Images, Texts

Vashantor: A Large-scale Multilingual Benchmark Dataset for Automated Translation of Bangla Regional Dialects to Bangla Language

The Vashantor dataset consists of 32,500 sentences from different regions, including Chittagong, Noakhali, Sylhet, Barishal, and Mymensingh. It is categorized into two language formats: "Bangla" and "Banglish." Each region and language combination has specified quantities for training, testing, and validation samples. The dataset details are as follows:

1 papers0 benchmarksTexts

Aerial Landmarks Recognition Dataset

We filter and match the landmarks in the Google Landmarks dataset with their OpenStreetMap polygons and filter for those located in the United States, resulting in 602 landmarks. Then, we obtain the latest high-resolution aerial images of the obtained polygons through the National Agriculture Imagery Program (NAIP) of the United States Department of Agriculture (USDA). Finally, we construct multiple-choice questions about the name of the landmark with incorrect answers from other landmarks in the same category.

1 papers0 benchmarksImages, Texts

PINS100

We assembled a benchmark of electronic component pinouts, PINS100, containing 100 common parts frequently used in circuits found on high-traffic electronic tutorial websites such as the ARDUINO PROJECT HUB and AUTODESK TINKERCAD CIRCUITS. Components range from 2 pins to 40 pins, and span a large assortment of part categories including passives (e.g. resistors/capacitors), input (e.g. switches), output (e.g. LEDs, motors, relays), sensors, integrated circuits, power regulators, logic (e.g. 7400- SERIES AND and OR gates), and microcontrollers (e.g ARDUINO, RASPBERRY PI ).

1 papers0 benchmarksTexts

MICRO25

To assess a model’s ability to create microcontroller-driven electronic devices, we developed a benchmark, MICRO25, that includes 25 tasks intended for the common ARDUINO microcontroller ecosystem.. These tasks, shown in Table 2, span 5 core categories including: input, interface protocols, output, sensors, and logic. Each task is either tailored to test a specific fundamental competency required to build basic microcontroller-driven electronic devices, or the integration of several competencies into larger design flows.

1 papers0 benchmarksTexts

InfoLossQA

The goal of InfoLossQA is to generate a series of QA pairs that reveal to lay readers what information a simplified text lacks compared to its original.

1 papers0 benchmarksTexts

NC-SentNoB (Noise Classification on SentNoB)

This is a multilabel dataset used for Noise Identification purpose in the paper "A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts" accepted in 2024 The 9th Workshop on Noisy and User-generated Text (W-NUT) collocated with EACL 2024.

1 papers0 benchmarksTexts

TARA

TARA is a dataset for tool-augmented reward modeling, which includes comprehensive comparison data of human preferences and detailed tool invocation processes.

1 papers0 benchmarksTexts

bigscience/P3 (bigscience/P3, split='ai2_arc_ARC_Challenge_pick_the_most_correct_option')

This datasets consists of challenging reasoning questions in multiple choice format.

1 papers0 benchmarksTexts

Concept-1K

Concept-1K contains 1023 novel concepts from six domains, including economy, culture, science and technology, environment, education, and health and medical. It has 16653 training-test QA pairs corresponding to 16653 knowledge points from 1023 concepts. It is proposed for evaluating the forgetting in large language models and the effectiveness of incremental learning algorithms.

1 papers0 benchmarksTexts

LLM Generated Spear Phishing Emails

This dataset comprises high-quality, targeted spear-phishing emails created using a proprietary system that harnesses the power of LLMs and knowledge graphs. The primary purpose of releasing this dataset is to promote and facilitate further research in the field of spear-phishing detection.

1 papers0 benchmarksTexts

PreviousPage 132 of 158Next