19,997 machine learning datasets
19,997 dataset results
This dataset can be used for semantic textual similarity tasks. It consists of duplicate and non-duplicate Github issues. It has 18565, 1547, and 1548 samples for train, validation, and test set, respectively.
GVLQA is the first vision-language QA dataset for general graph reasoning. Contains a base set GVLQA-BASE and four image-augmented subsets GVLQA-AUGLY, GVLQA-AUGNO, GVLQA-AUGNS, GVLQA-AUGET, where the samples are relatively corresponding with the base set. Contains 7 graph reasoning tasks: detecting cycle, connectivity, computing topological ordering, shortest path, maximum flow, bipartite matching num, and Hamilton path. Utility: 1) evaluate the graph reasoning capabilities of VLMs or LLMs; 2) help models acquire fundamental graph comprehension and reasoning abilities as a pretraining dataset.
EyeDentify, a dataset specifically designed for pupil diameter estimation based on webcam images.
🎯 DART-Math
Large-scale benchmark dataset of full-field digital mammography, called VinDr-Mammo, which consists of 5,000 four-view exams with breast-level assessment and finding annotations. Each of these exams was independently double read, with discordance (if any) being resolved by arbitration by a third radiologist.
FuLG is a comprehensive Romanian language corpus comprising 150 billion tokens, carefully extracted from Common Crawl. This extensive dataset is the result of rigorous filtering and deduplication processes applied to 95 Common Crawl snapshots. The compressed dataset has 289 GB.
A novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region.
MOMAland is an open source Python library for developing and comparing multi-objective multi-agent reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API.
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
MultiOOD is the first benchmark for Multimodal OOD Detection and covers diverse dataset sizes and modalities. MultiOOD comprises five video datasets with over 85, 000 video clips in total. The datasets vary in the number of classes, ranging from 7 to 229, and in size, spanning from 3k to 57k. Video, optical flow, and audio are used as different types of modalities.
PetFace is a large-scale animal face re-identification dataset that includes 257,484 unique individuals across 13 families and 319 breeds. PetFace has fine-grained annotation (sex, breeds, color, and patterns).
Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lacking nighttime raindrop scenarios. Consequently, algorithms trained on these datasets may struggle to perform effectively in raindrop-focused or nighttime scenarios. The absence of datasets specifically designed for raindrop-focused and nighttime raindrops constrains research in this area. In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. Raindrop Clarity comprises 15,186 high-quality pairs/triplets (raindrops, blur, and background) of images with raindrops and the corresponding clear background images. There are 5,442 daytime raindrop imag
This is 4 sketch style (4SKST) dataset, from the research paper "Semi-supervised reference-based sketch extraction using a contrastive learning framework" Dataset consists one of four different styles of sketches paired to color images.
ivrit.ai is a database of Hebrew audio and text content.
Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark for robust image-text matching/retrieval models. It contains 100K image-text pairs consisting of website pages and multilingual website meta-descriptions (98,000 pairs for training, 1,000 for validation, and 1,000 for testing). NoW has two main characteristics: without human annotations and the noisy pairs are naturally captured. The source image data of NoW is obtained by taking screenshots when accessing web pages on mobile user interface (MUI) with 720 $\times$ 1280 resolution, and we parse the meta-description field in the HTML source code as the captions. In NCR (predecessor of NCL), each image in all datasets were preprocessed using Faster-RCNN detector provided by Bottom-up Attention Model to generate 36 region proposals, and each proposal was encoded as a 2048-dimensional feature. Thus, following NCR, we release our the features instead of raw images for fair comparison. However, we can not just
This repository contains the code, data, and models of the paper titled "Math Word Problem Solving by Generating Linguistic Variants of Problem Statements" published in the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop).
Click to add a brief description of the dataset (Markdown and LaTeX enabled).
AVSync15 is a high-quality synchronized audio-video dataset curated from VGGSound. It is carefully curated with both automatic and manual steps, ensuring:
Understanding sentences that contain mathematical expressions in text form poses significant challenges. To address this, the importance of converting these expressions into a compiled formula is highlighted. For instance, the expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' from automatic speech recognition (ASR) is more readily comprehensible when displayed as a compiled formula $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. To develop a text-to-formula conversion system, we can break down the process into text-to-LaTeX and LaTeX-to-formula conversions, with the latter managed by various existing LaTeX engines. However, the former approach has been notably hindered by the severe scarcity of text-to-LaTeX paired data, which presents a significant challenge in this field. In this context, we introduce MathBridge, the first extensive dataset for translating mathematical spoken expressions into LaTeX, to establish a robust baseline for