Datasets

3,148 machine learning datasets

3,148 dataset results

ESP Dataset (Evaluation for Styled Prompt datase)

ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.

0 papers0 benchmarksImages, Texts

BalitaNLP

A Filipino multi-modal language dataset for text+visual tasks. Consists of 351,755 Filipino news articles gathered from Filipino news outlets.

0 papers0 benchmarksImages, Texts

X-Wines (A Wine Dataset for Recommender Systems and Machine Learning)

X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.

0 papers0 benchmarksImages, Ranking, Tabular, Texts, Time series

ICConv (A Large-scale Automated Intent-oriented and Context-aware Conversational Search Dataset)

The dataset contains 105,811 information-seeking conversations converted from MS MARCO. This dataset is constructed to relieve the data scarcity problem of conversational search to an extent. Considering the multi-intent problem and contextual information, this large-scale intent-oriented and context-aware dataset is automatically constructed based on the web search session data in MS MARCO. This dataset can be used to train and evaluate conversational search systems.

0 papers0 benchmarksTexts

MoralChoice Survey

MoralChoice is a survey dataset to evaluate the moral beliefs encoded in LLMs. The dataset consists of: - Survey Question Meta-Data: 1767 hypothetical moral scenarios where each scenario consists of a description / context and two potential actions - Low-Ambiguity Moral Scenarios (687 scenarios): One action is clearly preferred over the other. - High-Ambiguity Moral Scenarios (680 scenarios): Neither action is clearly preferred - Survey Question Templates: 3 hand-curated question templates - Survey Responses: Outputs from 28 open- and closed-sourced LLMs

0 papers0 benchmarksTexts

CIDII Dataset (Correct Information and Disinformation about Islamic Issues)

The CIDII dataset is a binary classification, consisting of two classes of correct information and disinformation related to Islamic issues. The CIDII dataset belongs to our research (DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES) published in MJCS journal in the link below: https://ejournal.um.edu.my/index.php/MJCS/article/view/41935

0 papers0 benchmarksTexts

ALTA 2023 Shared Task (Discriminate between human-authored and synthetic text generated by Large Language Models (LLMs))

This dataset is described in the ALTA 2023 Shared Task and associated CodaLab competition.

0 papers0 benchmarksTexts

ALTA 2022 Shared Task (PIBOSO Sentence classification)

This dataset is described in the ALTA 2022 Shared Task and associated CodaLab competition.

0 papers0 benchmarksTexts

Raw_-Subjective-Scores-120-videos

A review on raw subjective scores and data manipulation for before and after refining Mean opinion Scores

0 papers0 benchmarksTexts

Big-Five Backstage

The dataset consists of 3265 text samples corresponding to the concatenation of lines spoken by fictional characters. Texts are extracted from 400 theatre plays written by 132 different authors. Overall, it contains 3419136 words in total with a mean equal to 1047.2 words per character. Text entries have binary labels representing gender of a character (Male or Female) and their five personality traits (Extraversion, Agreeableness, Openness, Neuroticism, Conscientiousness). The auxiliary part of the dataset includes author-level labels reflecting their gender, country of origin, and years of life.

0 papers0 benchmarksTexts

SourceData-NLP (The SourceData-NLP dataset: integrating curation into scientific publishing for training large language models)

Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental de

0 papers0 benchmarksBiology, Biomedical, Texts

LMCQA (Legal Multiple Choice Question Answering)

This dataset contains a set of multiple-choice questions related to various legal topics. The dataset contains 20 questions covering various aspects of legal knowledge, such as the workings of the European Commission, types of legal documents, procedures in the court system, legal definitions, and European Union+United Kingdom law, among others.

0 papers0 benchmarksTexts

Second HAREM (Segundo HAREM)

The Second HAREM was an evaluation exercise in Portuguese Named Entity Recognition. It aims to refine text annotation processes, building on the First HAREM. Challenges include adapting guidelines for new texts and establishing a unified document with directives from both editions.

0 papers0 benchmarksTexts

SIGARRA News Corpus

This dataset was taken from the SIGARRA information system at the University of Porto (UP). Every organic unit has its own domain and produces academic news. We collected a sample of 1000 news, manually annotating 905 using the Brat rapid annotation tool. This dataset consists of three files. The first is a CSV file containing news published between 2016-12-14 and 2017-03-01. The second file is a ZIP archive containing one directory per organic unit, with a text file and an annotations file per news article. The third file is an XML containing the complete set of news in a similar format to the HAREM dataset format. This dataset is particularly adequate for training named entity recognition models.

0 papers0 benchmarksTexts

PropBank-PT

The PropBankPT (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed of 3,406 sentences and 44,598 tokens taken from the Wall Street Journal translated. For the creation of this PropBank we adopted a semi-automatic analysis with a double-blind annotation followed by adjudication. The resulting dataset contains three information levels: phrase constituency, grammatical functions, and phrase semantic roles. The main motivation behind the creation of this resource was to build a high quality data set with semantic information that could support the development of automatic semantic role labelers for Portuguese. The development of this resource started under the METANET4U project (at: http://metanet4u.eu/) whose main goal is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language process

0 papers0 benchmarksTexts

Mac-Morpho

Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 [1], and since then, two revisions have been made in order to improve the quality of the resource [2, 3]. The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in [3], and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.

0 papers0 benchmarksTexts

Events classification - Biotech news

A dataset specifically tailored to the biotech news sector, aiming to transcend the limitations of existing benchmarks. This dataset is rich in complex content, comprising various biotech news articles covering various events, thus providing a more nuanced view of information extraction challenges.

0 papers0 benchmarksTexts

ALFI (Annotations for Label-Free Images)

ALFI (Annotations for Label-Free Images) is a dataset of images and annotations for label-free microscopy imaging. It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community through figshare.

0 papers0 benchmarksBiology, Images, Texts, Tracking

NSText2SQL: An Open Source Text-to-SQL Dataset for Foundation Model Training

Numbers Station Text to SQL

0 papers0 benchmarksTexts

CTV-Dataset (Cyclist Top-View Dataset)

The CTV-Dataset (CTV stands for Cyclist Top-View) is a trajectories dataset for cyclist behaviour in mixed-traffic environments (aka. shared spaces). This dataset is meant to enlarge the available datasets in the community, focusing on cyclists as main road users to help the research in understanding and predicting cyclist behaviour in shared spaces. The dataset results from an experiment conducted in TU Clausthal to extract data from possible interaction scenarios with other road users, such as pedestrians and cars, in shared spaces. The scenarios were captured using a drone with 4K (3840×2160) resolution at 29.97 fps to ensure high-quality results. The trajectories were extracted using an in-house developed computer vision algorithm.

0 papers0 benchmarksTexts, Videos

PreviousPage 156 of 158Next