3,148 machine learning datasets
3,148 dataset results
ESP dataset (Evaluation for Styled Prompt dataset) is a new benchmark for zero-shot domain-conditional caption generation. The dataset aims to evaluate the capability to generate diverse domain-specific language conditioned on the same image. It comprises 4.8k captions from 1k images in the COCO Captions test set. We collected five text domains with everyday usage: blog, social media, instruction, story, and news using Amazon MTurk.
A Filipino multi-modal language dataset for text+visual tasks. Consists of 351,755 Filipino news articles gathered from Filipino news outlets.
X-Wines is a consistent wine dataset containing 100,646 instances and 21 million real evaluations carried out by users. Data were collected on the open Web in 2022 and pre-processed for wider free use. They refer to the scale 1–5 ratings carried out over a period of 10 years (2012–2021) for wines produced in 62 different countries.
The dataset contains 105,811 information-seeking conversations converted from MS MARCO. This dataset is constructed to relieve the data scarcity problem of conversational search to an extent. Considering the multi-intent problem and contextual information, this large-scale intent-oriented and context-aware dataset is automatically constructed based on the web search session data in MS MARCO. This dataset can be used to train and evaluate conversational search systems.
MoralChoice is a survey dataset to evaluate the moral beliefs encoded in LLMs. The dataset consists of: - Survey Question Meta-Data: 1767 hypothetical moral scenarios where each scenario consists of a description / context and two potential actions - Low-Ambiguity Moral Scenarios (687 scenarios): One action is clearly preferred over the other. - High-Ambiguity Moral Scenarios (680 scenarios): Neither action is clearly preferred - Survey Question Templates: 3 hand-curated question templates - Survey Responses: Outputs from 28 open- and closed-sourced LLMs
The CIDII dataset is a binary classification, consisting of two classes of correct information and disinformation related to Islamic issues. The CIDII dataset belongs to our research (DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES) published in MJCS journal in the link below: https://ejournal.um.edu.my/index.php/MJCS/article/view/41935
This dataset is described in the ALTA 2023 Shared Task and associated CodaLab competition.
This dataset is described in the ALTA 2022 Shared Task and associated CodaLab competition.
A review on raw subjective scores and data manipulation for before and after refining Mean opinion Scores
The dataset consists of 3265 text samples corresponding to the concatenation of lines spoken by fictional characters. Texts are extracted from 400 theatre plays written by 132 different authors. Overall, it contains 3419136 words in total with a mean equal to 1047.2 words per character. Text entries have binary labels representing gender of a character (Male or Female) and their five personality traits (Extraversion, Agreeableness, Openness, Neuroticism, Conscientiousness). The auxiliary part of the dataset includes author-level labels reflecting their gender, country of origin, and years of life.
Introduction: The scientific publishing landscape is expanding rapidly, creating challenges for researchers to stay up-to-date with the evolution of the literature. Natural Language Processing (NLP) has emerged as a potent approach to automating knowledge extraction from this vast amount of publications and preprints. Tasks such as Named-Entity Recognition (NER) and Named-Entity Linking (NEL), in conjunction with context-dependent semantic interpretation, offer promising and complementary approaches to extracting structured information and revealing key concepts. Results: We present the SourceData-NLP dataset produced through the routine curation of papers during the publication process. A unique feature of this dataset is its emphasis on the annotation of bioentities in figure legends. We annotate eight classes of biomedical entities (small molecules, gene products, subcellular components, cell lines, cell types, tissues, organisms, and diseases), their role in the experimental de
This dataset contains a set of multiple-choice questions related to various legal topics. The dataset contains 20 questions covering various aspects of legal knowledge, such as the workings of the European Commission, types of legal documents, procedures in the court system, legal definitions, and European Union+United Kingdom law, among others.
The Second HAREM was an evaluation exercise in Portuguese Named Entity Recognition. It aims to refine text annotation processes, building on the First HAREM. Challenges include adapting guidelines for new texts and establishing a unified document with directives from both editions.
This dataset was taken from the SIGARRA information system at the University of Porto (UP). Every organic unit has its own domain and produces academic news. We collected a sample of 1000 news, manually annotating 905 using the Brat rapid annotation tool. This dataset consists of three files. The first is a CSV file containing news published between 2016-12-14 and 2017-03-01. The second file is a ZIP archive containing one directory per organic unit, with a text file and an annotations file per news article. The third file is an XML containing the complete set of news in a similar format to the HAREM dataset format. This dataset is particularly adequate for training named entity recognition models.
The PropBankPT (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed of 3,406 sentences and 44,598 tokens taken from the Wall Street Journal translated. For the creation of this PropBank we adopted a semi-automatic analysis with a double-blind annotation followed by adjudication. The resulting dataset contains three information levels: phrase constituency, grammatical functions, and phrase semantic roles. The main motivation behind the creation of this resource was to build a high quality data set with semantic information that could support the development of automatic semantic role labelers for Portuguese. The development of this resource started under the METANET4U project (at: http://metanet4u.eu/) whose main goal is to contribute to the establishment of a pan-European digital platform that makes available language resources and services, encompassing both datasets and software tools, for speech and language process
Mac-Morpho is a corpus of Brazilian Portuguese texts annotated with part-of-speech tags. Its first version was released in 2003 [1], and since then, two revisions have been made in order to improve the quality of the resource [2, 3]. The corpus is available for download split into train, development and test sections. These are 76%, 4% and 20% of the corpus total, respectively (the reason for the unusual numbers is that the corpus was first split into 80%/20% train/test, and then 5% of the train section was set aside for development). This split was used in [3], and new POS tagging research with Mac-Morpho is encouraged to follow it in order to make consistent comparisons possible.
A dataset specifically tailored to the biotech news sector, aiming to transcend the limitations of existing benchmarks. This dataset is rich in complex content, comprising various biotech news articles covering various events, thus providing a more nuanced view of information extraction challenges.
ALFI (Annotations for Label-Free Images) is a dataset of images and annotations for label-free microscopy imaging. It consists of 29 time-lapse image sequences with various annotations (pixel-wise segmentation masks, object-wise bounding boxes, and tracking information), made publicly available to the scientific community through figshare.
Numbers Station Text to SQL
The CTV-Dataset (CTV stands for Cyclist Top-View) is a trajectories dataset for cyclist behaviour in mixed-traffic environments (aka. shared spaces). This dataset is meant to enlarge the available datasets in the community, focusing on cyclists as main road users to help the research in understanding and predicting cyclist behaviour in shared spaces. The dataset results from an experiment conducted in TU Clausthal to extract data from possible interaction scenarios with other road users, such as pedestrians and cars, in shared spaces. The scenarios were captured using a drone with 4K (3840×2160) resolution at 29.97 fps to ensure high-quality results. The trajectories were extracted using an in-house developed computer vision algorithm.