3,148 machine learning datasets
3,148 dataset results
nvBench is a large-scale NL2VIS (natural languagge to visualisations) benchmark, containing 25,750 (NL, VIS) pairs from 750 tables over 105 domains, synthesized from (NL, SQL) benchmarks to support cross-domain NLPVIS (Natural Language Query to Visualization) task.
Kobest is a benchmark for Korean language reasoning. It consists of five Korean-language downstream tasks. Professional Korean linguists designed the tasks that require advanced Korean linguistic knowledge.
There are eight essay sets. Each of the sets of essays was generated from a single prompt. Selected essays range from an average length of 150 to 550 words per response. Some of the essays are dependent upon source information and others are not. All responses were written by students ranging in grade levels from Grade 7 to Grade 10. All essays were hand graded and were double-scored. Each of the eight data sets has its own unique characteristics. The variability is intended to test the limits of your scoring engine's capabilities.
Given 10 minimally contrastive (highly similar) images and a complex description for one of them, the task is to retrieve the correct image. The source of most images are videos and descriptions as well as retrievals come from human.
MAVEN-ERE is a dataset designed for event relation extraction tasks containing 103,193 event coreference chains, 1,216,217 temporal relations, 57,992 causal relations, and 15,841 subevent relations.
DocILE is a large dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly 1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features:
MultI-Modal In-Context Instruction Tuning (MIMIC-IT) is a dataset for instruction tuning into multi-modal models, motivated by the Flamingo model's upstream interleaved format pretraining dataset. The data sample consists of a queried image-instruction-answer triplet, with the instruction-answer tailored to the image, and context. The context contains a series of image-instruction-answer triplets that contextually correlate with the queried triplet, emulating the relationship between the context and the queried image-text pair found in the MMC4 dataset.
We scrape data from GooBix, which contains 156 games of 5 × 5 mini crosswords. The goal is not just to solve the task, as more general crosswords can be readily solved with specialized NLP pipelines that leverage large-scale retrieval instead of LM. Rather, we aim to explore the limit of LM as a general problem solver that explores its own thoughts and guides its own exploration with deliberate reasoning as heuristics.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Multi-Modal Reading (MMR) Benchmark includes 550 annotated question-answer pairs across 11 distinct tasks involving texts, fonts, visual elements, bounding boxes, spatial relations, and grounding, with carefully designed evaluation metrics.
The MAGE dataset provides a large set of generated texts using 27 LLMs from seven different groups: OpenAI GPT, LLaMA, GLM130B, FLAN-T5, OPT, BigScience, and EleutherAI. In total, the dataset contains 432,682 texts, along with two additional sets. The first is an additional test set with texts from unseen domains generated by an unseen model, namely GPT-4. The second set is designed to evaluate the robustness of detectors against paraphrasing attacks. To achieve this, the GPT-3.5-turbo model was employed to paraphrase the sentences from the first set, with all paraphrased texts treated as machine-generated.
AMPS contains over 100,000 problems pulled from Khan Academy and approximately 5 million problems generated from manually designed Mathematica scripts.
This is the home of a collaborative data collection effort by U. Chicago and TTI-Chicago researchers. This is to our knowledge the first collection of American Sign Language fingerspelling data "in the wild," that is in naturally occurring (online) video.
This is the home of a collaborative data collection effort by U. Chicago and TTI-Chicago researchers. This is to our knowledge the first collection of American Sign Language fingerspelling data "in the wild," that is in naturally occurring (online) video. The collection consists of two data set releases, ChicagoFSWild and ChicagoFSWild+.
FM-IQA is a question-answering dataset containing over 150,000 images and 310,000 freestyle Chinese question-answer pairs and their English translations.
The Chinese judicial reading comprehension (CJRC) dataset contains approximately 10K documents and almost 50K questions with answers. The documents come from judgment documents and the questions are annotated by law experts.
DiscoFuse was created by applying a rule-based splitting method on two corpora - sports articles crawled from the Web, and Wikipedia. See the paper for a detailed description of the dataset generation process and evaluation.
The QMUL underGround Re-IDentification (GRID) dataset contains 250 pedestrian image pairs. Each pair contains two images of the same individual seen from different camera views. All images are captured from 8 disjoint camera views installed in a busy underground station. The figures beside show a snapshot of each of the camera views of the station and sample images in the dataset. The dataset is challenging due to variations of pose, colours, lighting changes; as well as poor image quality caused by low spatial resolution.
CHALET is a 3D house simulator with support for navigation and manipulation. Unlike existing systems, CHALET supports both a wide range of object manipulation, as well as supporting complex environemnt layouts consisting of multiple rooms. The range of object manipulations includes the ability to pick up and place objects, toggle the state of objects like taps or televesions, open or close containers, and insert or remove objects from these containers. In addition, the simulator comes with 58 rooms that can be combined to create houses, including 10 default house layouts. CHALET is therefore suitable for setting up challenging environments for various AI tasks that require complex language understanding and planning, such as navigation, manipulation, instruction following, and interactive question answering.
Griddly is an environment for grid-world based research. Griddly provides a highly optimized game state and rendering engine with a flexible high-level interface for configuring environments. Not only does Griddly offer simple interfaces for single, multi-player and RTS games, but also multiple methods of rendering, configurable partial observability and interfaces for procedural content generation.