3,148 machine learning datasets
3,148 dataset results
To effectively evaluate OmniCount across open-vocabulary, supervised, and few-shot counting tasks, a dataset catering to a broad spectrum of visual categories and instances featuring various visual categories with multiple instances and classes per image is essential. The current datasets, primarily designed for object counting focusing on singular object categories like humans and vehicles, fall short for multi-label object counting tasks. Despite the presence of multi-class datasets like MS COCO, their utility is limited for counting due to the sparse nature of object appearance. Addressing this gap, we created a new dataset with 30,230 images spanning 191 diverse categories, including kitchen utensils, office supplies, vehicles, and animals. This dataset, featuring a wide range of object instance counts per image ranging from 1 to 160 and an average count of 10, bridges the existing void and establishes a benchmark for assessing counting models in varied scenarios.
The MPII Human Pose Descriptions dataset extends the widely-used MPII Human Pose Dataset with rich textual annotations. These annotations are generated by various state-of-the-art language models (LLMs) and include detailed descriptions of the activities being performed, the count of people present, and their specific poses.
WikiFactDiff is a dataset designed as a resource to perform atomic factual knowledge updates on language models, with the goal of aligning them with current knowledge. It describes the evolution of factual knowledge between two dates, named T_old and T_new, in the form of semantic triples. To enable the possibility of evaluating knowledge algorithms (such as ROME, MEND, MEMIT, etc.), these triples are verbalized and neighbor facts are determined to check for eventual bleedover.
FABSA, An aspect-based sentiment analysis dataset in the Customer Feedback space (Trustpilot, Google Play and Apple Store reviews).
Multilingual explainable fact-checking dataset on Russia-Ukraine Conflict 2022
Raw negotiation transcripts generated for the paper "Evaluating Language Model Agency through Negotiations". The data includes transcripts from self-play (a model plays against an independent version of itself; corresponding to Section 4.1 of the paper) and cross-play (a model plays against another model; Section 4.2). This dataset encompasses 2926 transcripts (942 self-play, 1984 cross-play).
The dataset contains two few-shot chemical fine-grained entity extraction datasets, based on human-annotated ChemNER+ and CHEMET. For each dataset, we randomly sample a subset based on the frequency of each type class. Specifically, given a dataset, we first set the number of maximum entity mentions $k$ for the most frequent entity type in the dataset. We then randomly sample other types and ensure that the distribution of each type remains the same as in the original dataset. We choose the values $6, 9, 12, 15, 18$ as the potential maximum entity mentions for $k$. The ChemNER+ and CHEMET few-shot datasets contain 52 and 28 types respectively.
The Food Recall Incidents dataset consists of 7,546 short texts (from 5 to 360 characters each), which are the titles of food recall announcements (therefore referred to as title), crawled from 24 public food safety authority websites by Agroknow. The texts are written in 6 languages, with English (6,644) and German (888) being the most common, followed by French (8), Greek (4), Italian (1) and Danish (1). Most of the texts have been authored after 2010 and they describe recalls of specific food products due to specific hazards. Experts manually classified each text to four groups of classes describing hazards and products on two levels of granularity:
Dataset Overview vanilla.csv: Represents the interactions without specific role-play instructions. boss.csv: Interactions where ChatGPT plays the role of a user's boss. classmate.csv: Interactions with ChatGPT acting as the user's classmate. Each turn was coded with user motives of user responses, or the perceived naturalness of ChatGPT responses.
The dataset was collected from DOTA 2 using OpenDota API via Python. The collection consists of DOTA 2 in-game chat data that were manually categorized into 3 classifications: non-toxic, mild (toxicity), and toxic chats.
ConQA is a dataset created using the intersection between VisualGenome and MS-COCO. The goal of this dataset is to provide a new benchmark for text to image retrieval using short and less descriptive queries than the commonly use captions from MS-COCO or Flicker. ConQA consists of 80 queries divided into 50 conceptual and 30 descriptive queries. A descriptive query mentions some of the objects in the image, for instance, people chopping vegetables. While, a conceptual query does not mention objects or only refers to objects in a general context, e.g., working class life.
A large scale, C2C marketplace e-commerce dataset.
This dataset is based on the movie review polarity dataset (v2.0) collected and maintained by Bo Pang and Lillian Lee. Their dataset (we'll call it PL2.0) consists of 1000 positive and 1000 negative movie reviews obtained from the Internet Movie Database (IMDb) review archive.
"My ridiculous dog is amazing." [sentiment: positive]
The VlogQA consists of 10,076 question-answer pairs based on 1,230 transcript documents sourced from YouTube - an extensive source of user-uploaded content, covering the topics of food and travel in the Vietnamese language. This dataset is used for research in Vietnamese Spoken-Based Machine Reading Comprehension.
The $\text{BEAR}$ dataset and its larger version, $\text{BEAR}_{\text{big}}$, are benchmarks for evaluating common factual knowledge contained in language models.
The dataset contains the training and test data for the SOftware Mention Detection challenge. The data is derived from the SoMeSci Knowledge Graph of software mentions.
[Real or Fake] : Fake Job Description Prediction This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.
Overview The LaMini Dataset is an instruction dataset generated using h2ogpt-gm-oasst1-en-2048-falcon-40b-v2. It is designed for instruction-tuning pre-trained models to specialize them in a variety of downstream tasks.
Dataset Generation