19,997 machine learning datasets
19,997 dataset results
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
The CommonMT is a dataset used for evaluating commonsense reasoning in neural machine translation. This dataset contains three types of test suites: - Lexical ambiguity - Contextless syntactic ambiguity - Contextual syntactic ambiguity
The Wino-X dataset is a multilingual collection of Winograd Schemas. It was introduced as a tool for evaluating coreference resolution (CoR) and commonsense reasoning (CSR) capabilities of computational models. The dataset contains schemas in German, French, and Russian, aligned with their English counterparts.
The ScaLA dataset is a linguistic acceptability dataset for the Scandinavian languages, including Danish, Norwegian Bokmål, Norwegian Nynorsk, Swedish, Icelandic, and Faroese. It was developed as part of the ScandEval benchmarking platform and consists of sentences in these languages that are either grammatically correct or incorrect. The dataset is designed to evaluate the ability of language models to distinguish between grammatically correct and incorrect sentences in the Scandinavian languages. It is one of the contributions of the ScandEval project, aiming to advance the state of natural language processing in the Scandinavian languages.
For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth.
The 8TAGS dataset is a corpus specifically created for the evaluation of sentence representations in Polish. It consists of approximately 50,000 sentences annotated with eight topic labels, including film, history, food, medicine, motorization, work, sport, and technology. The dataset was automatically generated by extracting sentences from headlines and short descriptions of articles posted on the Polish social networking site wykop.pl. The corpus contains cleaned and tokenized, unambiguous sentences, each tagged with only one of the selected categories and longer than 30 characters. The classification accuracy is reported for this dataset as a part of the evaluation of sentence representations in Polish.
This is the dataset for knowledge editing. It contains six tasks: ZsRE, $Wiki_{recent}$, $Wiki_{counterfact}$, WikiBio, ConvSent and Sanitation. This repo shows the former 4 tasks and you can get the data for ConvSent and Sanitation from their original papers.
Kitsune Network Attack Dataset This is a collection of nine network attack datasets captured from a either an IP-based commercial surveillance system or a network full of IoT devices. Each dataset contains millions of network packets and diffrent cyber attack within it.
ABSTRACT In this project, we propose a new comprehensive realistic cyber security dataset of IoT and IIoT applications, called Edge-IIoTset, which can be used by machine learning-based intrusion detection systems in two different modes, namely, centralized and federated learning. Specifically, the proposed testbed is organized into seven layers, including, Cloud Computing Layer, Network Functions Virtualization Layer, Blockchain Network Layer, Fog Computing Layer, Software-Defined Networking Layer, Edge Computing Layer, and IoT and IIoT Perception Layer. In each layer, we propose new emerging technologies that satisfy the key requirements of IoT and IIoT applications, such as, ThingsBoard IoT platform, OPNFV platform, Hyperledger Sawtooth, Digital twin, ONOS SDN controller, Mosquitto MQTT brokers, Modbus TCP/IP, ...etc. The IoT data are generated from various IoT devices (more than 10 types) such as Low-cost digital sensors for sensing temperature and humidity, Ultrasonic sensor, Wate
Taxi speed data in 15min interval from 156 sensors on major roads of Luohu District in Shenzhen, China, from Jan. 1 to Jan. 31, 2015.
We introduce the Clothing Attribute Dataset for promoting research in learning visual attributes for objects. The dataset contains 1856 images, with 26 ground truth clothing attributes such as "long-sleeves", "has collar", and "striped pattern". The labels were collected using Amazon Mechanical Turk.
The historical color image dataset is collected for the task of automatically estimating the age of historical color photos. Each image is annotated with its associated decade, where five decades from the 1930s to 1970s are considered. There are 265 images for each category
MetaHate: A Dataset for Unifying Efforts on Hate Speech Detection This is MetaHate: a meta-collection of 36 hate speech datasets from social media comments.
CHOCOLATE is a benchmark for detecting and correcting factual inconsistency in generated chart captions. It consists of captions produced by six advanced models, which are categorized into three subsets:
CIDAR contains 10,000 instructions and their output. The dataset was created by selecting around 9,109 samples from Alpagasus dataset then translating it to Arabic using ChatGPT. In addition, we append that with around 891 Arabic grammar instructions from the webiste Ask the teacher. All the 10,000 samples were reviewed by around 12 reviewers.
We introduced this dataset in Points2Surf, a method that turns point clouds into meshes.
We established a 3D evaluation benchmark, 3D MM-Vet, to assess the 4-level capacity in embodied interaction scenarios, varying from basic perception to control statements generation.
Large-scale human activity recognition dataset in free-living environment for 151 participants.
For each dataset we provide a short description as well as some characterization metrics. It includes the number of instances (m), number of attributes (d), number of labels (q), cardinality (Card), density (Dens), diversity (Div), average Imbalance Ratio per label (avgIR), ratio of unconditionally dependent label pairs by chi-square test (rDep) and complexity, defined as m × q × d as in [Read 2010]. Cardinality measures the average number of labels associated with each instance, and density is defined as cardinality divided by the number of labels. Diversity represents the percentage of labelsets present in the dataset divided by the number of possible labelsets. The avgIR measures the average degree of imbalance of all labels, the greater avgIR, the greater the imbalance of the dataset. Finally, rDep measures the proportion of pairs of labels that are dependent at 99% confidence. A broader description of all the characterization metrics and the used partition methods are described in
Click to add a brief description of the dataset (Markdown and LaTeX enabled).