19,997 machine learning datasets
19,997 dataset results
Sentiment analysis of codemixed tweets.
Haze4k is a synthesized dataset with 4,000 hazy images, in which each hazy image has the associate ground truths of a latent clean image, a transmission map, and an atmospheric light ma
ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event.
ProofNet is a benchmark for autoformalization and formal proving of undergraduate-level mathematics. The ProofNet benchmarks consists of 371 examples, each consisting of a formal theorem statement in Lean 3, a natural language theorem statement, and natural language proof. The problems are primarily drawn from popular undergraduate pure mathematics textbooks and cover topics such as real and complex analysis, linear algebra, abstract algebra, and topology.
Android in the Wild (AitW) is a dataset for device-control research which is orders of magnitude larger than current datasets. The dataset contains human demonstrations of device interactions, including the screens and actions, and corresponding natural language instructions. It consists of 715k episodes spanning 30k unique instructions, four versions of Android (v10–13), and eight device types (Pixel 2 XL to Pixel 6) with varying screen resolutions. It contains multi-step tasks that require semantic understanding of language and visual context.
In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance.
A benchmark dataset for out-of-distribution detection. ImageNet-1k is in-distribution, while Textures is out-of-distribution.
Certainly! FinanceBench is a groundbreaking benchmark designed for evaluating the performance of large language models (LLMs) in the domain of financial question answering (QA). Here are the key details about FinanceBench:
We contribute an IntentQA dataset with diverse intents in daily social activities.
This dataset details the energy consumption of appliances in a low-energy building over 4.5 months. Data was collected at 10-minute intervals.
AndroidWorld is an environment for building and benchmarking autonomous computer control agents.
Pulsar candidates collected during the HTRU survey. Pulsars are a type of star, of considerable scientific interest. Candidates must be classified in to pulsar and non-pulsar classes to aid discovery.
ImageNet VID is a large-scale public dataset for video object detection and contains more than 1M frames for training and more than 100k frames for validation.
Street Scene is a dataset for video anomaly detection. Street Scene consists of 46 training and 35 testing high resolution 1280×720 video sequences taken from a USB camera overlooking a scene of a two-lane street with bike lanes and pedestrian sidewalks during daytime. The dataset is challenging because of the variety of activity taking place such as cars driving, turning, stopping and parking; pedestrians walking, jogging and pushing strollers; and bikers riding in bike lanes. In addition the videos contain changing shadows, moving background such as a flag and trees blowing in the wind, and occlusions caused by trees and large vehicles. There are a total of 56,847 frames for training and 146,410 frames for testing, extracted from the original videos at 15 frames per second. The dataset contains a total of 205 naturally occurring anomalous events ranging from illegal activities such as jaywalking and illegal U-turns to simply those that do not occur in the training set such as pets be
The SUN09 dataset consists of 12,000 annotated images with more than 200 object categories. It consists of natural, indoor and outdoor images. Each image contains an average of 7 different annotated objects and the average occupancy of each object is 5% of image size. The frequencies of object categories follow a power law distribution.
The exact pre-processing steps used to construct the MNIST dataset have long been lost. This leaves us with no reliable way to associate its characters with the ID of the writer and little hope to recover the full MNIST testing set that had 60K images but was never released. The official MNIST testing set only contains 10K randomly sampled images and is often considered too small to provide meaningful confidence intervals. The QMNIST dataset was generated from the original data found in the NIST Special Database 19 with the goal to match the MNIST preprocessing as closely as possible. QMNIST is licensed under the BSD-style license.
ContactPose is a dataset of hand-object contact paired with hand pose, object pose, and RGB-D images. ContactPose has 2306 unique grasps of 25 household objects grasped with 2 functional intents by 50 participants, and more than 2.9 M RGB-D grasp images.
The PASCAL-Scribble Dataset is an extension of the PASCAL dataset with scribble annotations for semantic segmentation. The annotations follow two different protocols. In the first protocol, the PASCAL VOC 2012 set is annotated, with 20 object categories (aeroplane, bicycle, ...) and one background category. There are 12,031 images annotated, including 10,582 images in the training set and 1,449 images in the validation set. In the second protocol, the 59 object/stuff categories and one background category involved in the PASCAL-CONTEXT dataset are used. Besides the 20 object categories in the first protocol, there are 39 extra categories (snow, tree, ...) included. This protocol is followed to annotate the PASCAL-CONTEXT dataset. 4,998 images in the training set have been annotated.
MNIST8M is derived from the MNIST dataset by applying random deformations and translations to the dataset.
The StarCraft II Learning Environment (S2LE) is a reinforcement learning environment based on the game StarCraft II. The environment consists of three sub-components: a Linux StarCraft II binary, the StarCraft II API and PySC2. The StarCraft II API allows programmatic control of StarCraft II. It can be used to start a game, get observations, take actions, and review replays. PyC2 is a Python environment that wraps the StarCraft II API to ease the interaction between Python reinforcement learning agents and StarCraft II. It defines an action and observation specification, and includes a random agent and a handful of rule-based agents as examples. It also includes some mini-games as challenges and visualization tools to understand what the agent can see and do.