Datasets

19,997 machine learning datasets

19,997 dataset results

ViMQ

ViMQ is a Vietnamese dataset of medical questions from patients with sentence-level and entity-level annotations for the Intent Classification and Named Entity Recognition tasks. It contains Vietnamese medical questions crawled from the consultation section online between patients and doctors from www.vinmec.com, a website of a Vietnamese general hospital. Each consultation consists of a question regarding a specific health issue of a patient and a detailed respond provided by a clinical expert. The dataset contains health issues that fall into a wide range of categories including common illness, cardiology, hematology, cancer, pediatrics, etc. We removed sections where users ask about information of the hospital and selected 9,000 questions for the dataset.

3 papers0 benchmarksTexts

ResQ (Real-world Spatial Question Answering)

ReSQ is a real-world Spatial Question Answering dataset with human-generated questions built on an existing corpus with SpRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations.

3 papers0 benchmarks

CoScript

CoScript is a constrained language planning dataset, which consists of 55,000 scripts.

3 papers0 benchmarksTexts

GeoGLUE (GeoGraphic Language Understanding Evaluation Benchmark)

GeoGLUE is a GeoGraphic Language Understanding Evaluation benchmark, which consists of six geographic text-related tasks, including geographic textual similarity on recall, geotagged geographic elements tagging, geographic composition analysis, geographic where what cut, and geographic entity alignment. All tasks' datasets are collected from open-released resources.

3 papers0 benchmarksTexts

AfriQA

AfriQA is a cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages, where relevant passages are retrieved in a high-resource language spoken in the corresponding region and answers are translated into the source language. The dataset enables the development of more equitable QA technology.

3 papers0 benchmarksTexts

CREMP

CREMP is a resource generated for the rapid development and evaluation of machine learning models for macrocyclic peptides. CREMP contains 36,198 unique macrocyclical peptides and their high-quality structural ensembles generated using the Conformer-Rotamer Ensemble Sampling Tool (CREST).

3 papers0 benchmarksBiology

SYNTH-PEDES

SYNTH-PEDES is a large-scale person dataset with image-text pairs by far, which contains 312,321 identities, 4,791,711 images, and 12,138,157 textual descriptions.

3 papers0 benchmarksImages, Texts

default of credit card clients Data Set

This research aimed at the case of customers default payments in Taiwan and compares the predictive accuracy of probability of default among six data mining methods. From the perspective of risk management, the result of predictive accuracy of the estimated probability of default will be more valuable than the binary result of classification - credible or not credible clients. Because the real probability of default is unknown, this study presented the novel Sorting Smoothing Method to estimate the real probability of default. With the real probability of default as the response variable (Y), and the predictive probability of default as the independent variable (X), the simple linear regression result (Y = A + BX) shows that the forecasting model produced by artificial neural network has the highest coefficient of determination; its regression intercept (A) is close to zero, and regression coefficient (B) to one. Therefore, among the six data mining techniques, artificial neural networ

3 papers0 benchmarksFinancial

GDELT

The GDELT Project is a remarkable initiative that monitors our world by analyzing global news from various sources. Here are the key aspects of the GDELT dataset:

3 papers1 benchmarks

ExplainCPE

This is a medical multiple-choice dataset with explanations which can be used to interpret the answer. The data comes from Chinese Pharmacist Examination. Each piece of data has a question, five options, a gold_answer and a gold_explanation.

3 papers0 benchmarks

JDsearch

JDsearch is a personalized product search dataset comprised of real user queries and diverse user-product interaction types (clicking, adding to cart, following, and purchasing) collected from JD.com, a popular Chinese online shopping platform. More specifically, the authors sample about 170,000 active users on a specific date, then record all their interacted products and issued queries in one year, without removing any tail users and products. This finally results in roughly 12,000,000 products, 9,400,000 real searches, and 26,000,000 user-product interactions.

3 papers0 benchmarksTexts

Drug Combination Extraction Dataset

This dataset consists of 1634 biomedical abstracts, expert-annotated for the purpose of extracting information about the efficacy of drug combinations from the scientiﬁc literature. Beyond its practical utility, the dataset also presents a unique NLP challenge, as the ﬁrst relation extraction dataset consisting of variable-length relations. Furthermore, the relations in this dataset predominantly require language understanding beyond the sentence level, adding to the challenge of this task. We provide a promising baseline model (see the paper/repo) and identify clear areas for further improvement. We ask that new methods on this dataset are posted to our public leaderboard to improve visibility: https://leaderboard.allenai.org/drug_combo/submissions/public

3 papers2 benchmarks

ICSI Meeting Corpus

ICSI Meeting Corpus in JSON format.

3 papers1 benchmarksTexts

Zambezi Voice

This work introduces Zambezi Voice, an open-source multilingual speech resource for Zambian languages. It contains two collections of datasets: unlabelled audio recordings of radio news and talk shows programs (160 hours) and labelled data (over 80 hours) consisting of read speech recorded from text sourced from publicly available literature books. The dataset is created for speech recognition but can be extended to multilingual speech processing research for both supervised and unsupervised learning approaches. To our knowledge, this is the first multilingual speech dataset created for Zambian languages. We exploit pretraining and cross-lingual transfer learning by finetuning the Wav2Vec2.0 large-scale multilingual pre-trained model to build end-to-end (E2E) speech recognition models for our baseline models. The dataset is released publicly under a Creative Commons BY-NC-ND 4.0 license and can be accessed through the project repository.

3 papers0 benchmarksAudio, Images, Texts

RESD (Russian Emotional Speech Dialogs with annotated text)

Russian dataset of emotional speech dialogues. This dataset was assembled from ~3.5 hours of live speech by actors who voiced pre-distributed emotions in the dialogue for ~3 minutes each. <br> Each sample of dataset contains name of part from the original dataset studio source, speech file (16000 or 44100Hz) of human voice, 1 of 7 labeled emotions and the speech-to-texted part of voice speech. <br>

3 papers6 benchmarksAudio, Speech, Texts

MMConv

The main goal of the data collection is to acquire highly natural conversations that cover a wide variety of styles and scenarios. In total, the presented corpus consists of five domains: Food, Hotel, Nightlife, Shopping mall and Sightseeing. Controlled by our various task settings, the collected dialogues cover between one to four domains per dialogue, and are thus of greatly varying length and complexity. There are 808 single-task dialogues that contains a single venue target and 4, 298 multi-task dialogues consisting of at least two to four venue targets. These different venues vary in domains most of the times.

3 papers7 benchmarksImages, Texts

WeiboPolls

Dataset Description The dataset described in the provided text is focused on social media polls collected from Weibo, a popular Chinese microblogging platform. The dataset aims to empirically study social media polls and analyze user engagement patterns.

3 papers8 benchmarksTexts

Top-N Recommendation Runs

We ran 21 recommender systems on three datasets (BeerAdvocate, LibraryThing and MovieLens 1M). The output of these recommenders was evaluated using rec_eval tool. We also measured statistically significant improvements using permutation test. The output of both tools can be found in data.

3 papers0 benchmarks

TREC Ad Hoc Retrieval Runs (2020)

TREC Submissions for all Ad Hoc Retrieval runs.

3 papers0 benchmarks

APPS-ET

Extension test cases of APPS, as well as generated code.

3 papers0 benchmarks

PreviousPage 284 of 1000Next