Datasets

3,148 machine learning datasets

3,148 dataset results

HumanEval-XL

We introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.

6 papers0 benchmarksTexts

EvalCrafter Text-to-Video (ECTV) Dataset

This dataset contains around 10000 videos generated by various methods using the Prompt list. These videos have been evaluated using the innovative EvalCrafter framework, which assesses generative models across visual, content, and motion qualities using 17 objective metrics and subjective user opinions.

6 papers5 benchmarksTexts, Videos

EarthVQA (A multi-modal multi-task VQA dataset for remote sensing)

Earth vision research typically focuses on extracting geospatial object locations and categories but neglects the exploration of relations between objects and comprehensive reasoning. Based on city planning needs, we develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis. The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded.

6 papers2 benchmarksImages, Texts

WikiNews Dataset (WikiNews Arabic Diacritization Benchmark Dataset)

The WikiNews Arabic Diacritization dataset is a test set composed of 70 WikiNews articles (majority are from 2013 and 2014) that cover a variety of themes, namely: politics, economics, health, science and technology, sports, arts, and culture. The articles are evenly distributed among the different themes (10 per theme). The articles contain 18,300 words with around 400 different sentences (Each line is considered as a sentence).

6 papers0 benchmarksTexts

MidiCaps

The MidiCaps dataset [1] is a large-scale dataset of 168,385 midi music files with descriptive text captions, and a set of extracted musical features.

6 papers0 benchmarksMidi, Texts

READ2016(line-level) (Line-level Handwritten Text Recognition on READ 2016)

This dataset arises from the READ project (Horizon 2020).

6 papers4 benchmarksImages, Texts

GTSinger (GTSinger: A Global Multi-Technique Singing Corpus with Realistic Music Scores for All Singing Tasks)

The scarcity of high-quality and multi-task singing datasets significantly hinders the development of diverse controllable and personalized singing tasks, as existing singing datasets suffer from low quality, limited diversity of languages and singers, absence of multi-technique information and realistic music scores, and poor task suitability. To tackle these problems, we present GTSinger, a large Global, multi-Technique, free-to-use, high-quality singing corpus with realistic music scores, designed for all singing tasks, along with its benchmarks. Particularly, (1) we collect 80.59 hours of high-quality songs, forming the largest recorded singing dataset; (2) 20 professional singers across nine languages offer diverse timbres and styles; (3) we provide controlled comparison and phoneme-level annotations of six singing techniques, helping technique modeling and control; (4) GTSinger offers realistic music scores, assisting real-world musical composition; (5) singing voices are accompa

6 papers0 benchmarksAudio, Music, Speech, Texts

MuChoMusic

MuChoMusic is a benchmark designed to evaluate music understanding in multimodal language models focused on audio. It includes 1,187 multiple-choice questions validated by human annotators, based on 644 music tracks from two publicly available music datasets. These questions cover a wide variety of genres and assess knowledge and reasoning across several musical concepts and their cultural and functional contexts. The benchmark provides a holistic evaluation of five open-source models, revealing challenges such as over-reliance on the language modality and highlighting the need for better multimodal integration.

6 papers0 benchmarksAudio, Music, Texts

GEdit-Bench-EN

This dataset is a new benchmark, grounded in real-world usages is developed to support more authentic and comprehensive evaluation of image editing models.

6 papers3 benchmarksImages, Texts

Wikipedia Person and Animal Dataset

This dataset gathers 428,748 person and 12,236 animal infobox with descriptions based on Wikipedia dump (2018/04/01) and Wikidata (2018/04/12).

5 papers7 benchmarksTexts

MemexQA

A large, realistic multimodal dataset consisting of real personal photos and crowd-sourced questions/answers.

5 papers1 benchmarksImages, Texts

Species-800

Species-800 is a corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions. To increase the corpus taxonomic mention diversity the 800 abstracts were collected by selecting 100 abstracts from the following 8 categories: bacteriology, botany, entomology, medicine, mycology, protistology, virology and zoology. 800 has been annotated with a focus at the species level; however, higher taxa mentions (such as genera, families and orders) have also been considered.

5 papers1 benchmarksTexts

LINNAEUS

LINNAEUS is a general-purpose dictionary matching software, capable of processing multiple types of document formats in the biomedical domain (MEDLINE, PMC, BMC, OTMI, text, etc.). It can produce multiple types of output (XML, HTML, tab-separated-value file, or save to a database). It also contains methods for acting as a server (including load balancing across several servers), allowing clients to request matching over a network. A package with files for recognizing and identifying species names is available for LINNAEUS, showing 94% recall and 97% precision compared to LINNAEUS-species-corpus.

5 papers1 benchmarksTexts

Jester (Jokes)

6.5 million anonymous ratings of jokes by users of the Jester Joke Recommender System.

5 papers0 benchmarksTexts

ToLD-Br (Toxic Language Detection for Brazilian Portuguese)

The Toxic Language Detection for Brazilian Portuguese (ToLD-Br) is a dataset with tweets in Brazilian Portuguese annotated according to different toxic aspects.

5 papers2 benchmarksTexts

HowToVQA69M

A dataset of 69,270,581 video clip, question and answer triplets (v, q, a). HowToVQA69M is two orders of magnitude larger than any of the currently available VideoQA datasets.

5 papers0 benchmarksTexts, Videos

TutorialVQA

TutorialVQA is a new type of dataset used to find answer spans in tutorial videos. The dataset includes about 6,000 triples, comprised of videos, questions, and answer spans manually collected from screencast tutorial videos with spoken narratives for a photo-editing software.

5 papers0 benchmarksTexts, Videos

STACKEX

STACKEX expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks.

5 papers0 benchmarksTexts

MATINF (Maternal and Infant Dataset)

Maternal and Infant (MATINF) Dataset is a large-scale dataset jointly labeled for classification, question answering and summarization in the domain of maternity and baby caring in Chinese. An entry in the dataset includes four fields: question (Q), description (D), class (C) and answer (A).

5 papers0 benchmarksTexts

RUN (The RUN Dataset)

The RUN dataset is based on OpenStreetMap (OSM). The map contains rich layers and an abundance of entities of different types. Each entity is complex and can contain (at least) four labels: name, type, is building=y/n, and house number. An entity can spread over several tiles. As the maps do not overlap, only very few entities are shared among them. The RUN dataset aligns NL navigation instructions to coordinates of their corresponding route on the OSM map.

5 papers0 benchmarksTexts

PreviousPage 60 of 158Next