Datasets

19,997 machine learning datasets

19,997 dataset results

GIS (Github Issue Similarity)

This dataset can be used for semantic textual similarity tasks. It consists of duplicate and non-duplicate Github issues. It has 18565, 1547, and 1548 samples for train, validation, and test set, respectively.

2 papers0 benchmarks

GVLQA (Graph Vision-Language Question-Answering)

GVLQA is the first vision-language QA dataset for general graph reasoning. Contains a base set GVLQA-BASE and four image-augmented subsets GVLQA-AUGLY, GVLQA-AUGNO, GVLQA-AUGNS, GVLQA-AUGET, where the samples are relatively corresponding with the base set. Contains 7 graph reasoning tasks: detecting cycle, connectivity, computing topological ordering, shortest path, maximum flow, bipartite matching num, and Hamilton path. Utility: 1) evaluate the graph reasoning capabilities of VLMs or LLMs; 2) help models acquire fundamental graph comprehension and reasoning abilities as a pretraining dataset.

2 papers0 benchmarksGraphs, Images

EyeDentify

EyeDentify, a dataset specifically designed for pupil diameter estimation based on webcam images.

2 papers0 benchmarks

DART-Math-Hard

🎯 DART-Math

2 papers0 benchmarksTexts

Vindr-Mammo

Large-scale benchmark dataset of full-field digital mammography, called VinDr-Mammo, which consists of 5,000 four-view exams with breast-level assessment and finding annotations. Each of these exams was independently double read, with discordance (if any) being resolved by arbitration by a third radiologist.

2 papers0 benchmarksImages

FuLG

FuLG is a comprehensive Romanian language corpus comprising 150 billion tokens, carefully extracted from Common Crawl. This extensive dataset is the result of rigorous filtering and deduplication processes applied to 95 Common Crawl snapshots. The compressed dataset has 289 GB.

2 papers0 benchmarksTexts

LatamXIX (19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction)

A novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region.

2 papers0 benchmarks

MOMAland

MOMAland is an open source Python library for developing and comparing multi-objective multi-agent reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API.

2 papers0 benchmarks

TurEV-DB

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

RefinedWeb

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

MultiOOD (Multimodal Out-of-Distribution Detection Benchmark)

MultiOOD is the first benchmark for Multimodal OOD Detection and covers diverse dataset sizes and modalities. MultiOOD comprises five video datasets with over 85, 000 video clips in total. The datasets vary in the number of classes, ranging from 7 to 229, and in size, spanning from 3k to 57k. Video, optical flow, and audio are used as different types of modalities.

2 papers0 benchmarksAudio, Videos

PetFace

PetFace is a large-scale animal face re-identification dataset that includes 257,484 unique individuals across 13 families and 319 breeds. PetFace has fine-grained annotation (sex, breeds, color, and patterns).

2 papers0 benchmarks

RaindropClarity (A Dual-Focused Dataset for Day and Night Raindrop Removal)

Existing raindrop removal datasets have two shortcomings. First, they consist of images captured by cameras with a focus on the background, leading to the presence of blurry raindrops. To our knowledge, none of these datasets include images where the focus is specifically on raindrops, which results in a blurry background. Second, these datasets predominantly consist of daytime images, thereby lacking nighttime raindrop scenarios. Consequently, algorithms trained on these datasets may struggle to perform effectively in raindrop-focused or nighttime scenarios. The absence of datasets specifically designed for raindrop-focused and nighttime raindrops constrains research in this area. In this paper, we introduce a large-scale, real-world raindrop removal dataset called Raindrop Clarity. Raindrop Clarity comprises 15,186 high-quality pairs/triplets (raindrops, blur, and background) of images with raindrops and the corresponding clear background images. There are 5,442 daytime raindrop imag

2 papers0 benchmarksImages

4SKST (4 Sketch Style)

This is 4 sketch style (4SKST) dataset, from the research paper "Semi-supervised reference-based sketch extraction using a contrastive learning framework" Dataset consists one of four different styles of sketches paired to color images.

2 papers0 benchmarks

ivrit.ai (database of Hebrew audio and text content.)

ivrit.ai is a database of Hebrew audio and text content.

2 papers0 benchmarks

NoW (Noise of Web)

Noise of Web (NoW) is a challenging noisy correspondence learning (NCL) benchmark for robust image-text matching/retrieval models. It contains 100K image-text pairs consisting of website pages and multilingual website meta-descriptions (98,000 pairs for training, 1,000 for validation, and 1,000 for testing). NoW has two main characteristics: without human annotations and the noisy pairs are naturally captured. The source image data of NoW is obtained by taking screenshots when accessing web pages on mobile user interface (MUI) with 720 $\times$ 1280 resolution, and we parse the meta-description field in the HTML source code as the captions. In NCR (predecessor of NCL), each image in all datasets were preprocessed using Faster-RCNN detector provided by Bottom-up Attention Model to generate 36 region proposals, and each proposal was encoded as a 2048-dimensional feature. Thus, following NCR, we release our the features instead of raw images for fair comparison. However, we can not just

2 papers0 benchmarksImages, Texts

ParaMAWPS (Paraphrased Math Word Problem Solving Repository)

This repository contains the code, data, and models of the paper titled "Math Word Problem Solving by Generating Linguistic Variants of Problem Statements" published in the Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop).

2 papers4 benchmarksTexts

ETH and UCY

Click to add a brief description of the dataset (Markdown and LaTeX enabled).

2 papers0 benchmarks

AVSync15

AVSync15 is a high-quality synchronized audio-video dataset curated from VGGSound. It is carefully curated with both automatic and manual steps, ensuring:

2 papers0 benchmarksAudio, Videos

MathBridge (MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into LaTeX Formulas for Improved Readability)

Understanding sentences that contain mathematical expressions in text form poses significant challenges. To address this, the importance of converting these expressions into a compiled formula is highlighted. For instance, the expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' from automatic speech recognition (ASR) is more readily comprehensible when displayed as a compiled formula $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. To develop a text-to-formula conversion system, we can break down the process into text-to-LaTeX and LaTeX-to-formula conversions, with the latter managed by various existing LaTeX engines. However, the former approach has been notably hindered by the severe scarcity of text-to-LaTeX paired data, which presents a significant challenge in this field. In this context, we introduce MathBridge, the first extensive dataset for translating mathematical spoken expressions into LaTeX, to establish a robust baseline for

2 papers0 benchmarks

PreviousPage 351 of 1000Next