19,997 machine learning datasets
19,997 dataset results
A collection of 1000 public domain volumes that were scanned as part of the Google Book Search project. It is being distributed to support research in a variety of disciplines. Each volume comes with the scanned images, OCR output, page tags and basic metadata. The volumes in this dataset are written in 4 languages: English, French, Italian and Spanish. This document describes the organization of the dataset and the file formats.
Despite recent improvements in open-domain dialogue models, state of the art models are trained and evaluated on short conversations with little context. In contrast, the long-term conversation setting has hardly been studied. In this work we collect and release a humanhuman dataset consisting of multiple chat sessions whereby the speaking partners learn about each other’s interests and discuss the things they have learnt from past sessions. We show how existing models trained on existing datasets perform poorly in this long-term conversation setting in both automatic and human evaluations, and we study long-context models that can perform much better. In particular, we find retrieval-augmented methods and methods with an ability to summarize and recall previous conversations outperform the standard encoder-decoder architectures currently considered state of the art.
IISc VEED-Dynamic consists of 200 diverse indoor and outdoor scenes (see samples below). The videos are rendered using blender and the blend files obtained for the scenes are mainly from blendswap and turbosquid. 4 different camera trajectories are added to each scene and thus we have a total of 800 videos. Motion is added to pre-existing objects in the scene or new objects are added and animated. The videos are rendered at full HD resolution (1920 x 1080) and at 30fps and contain 12 frames each.
FINDSum is a large-scale dataset for long text and multi-table summarization. It is built on 21,125 annual reports from 3,794 companies and has two subsets for summarizing each company’s results of operations and liquidity.
DeePhy is a novel DeepFake Phylogeny dataset consisting of 5040 DeepFake videos generated using three different generation techniques. It is one of the first datasets which incorporates the concept of Deepfake Phylogeny which refers to the idea of generation of DeepFakes using multiple generation techniques in a sequential manner.
A novel dataset for identifying privacy policy compliance of Android third-party libraries.
Digital-Twin Tracking Dataset (DTTD) is a novel RGB-D dataset to enable further research of the problem and extend potential solutions towards longer ranges and mm localization accuracy. In total, 103 scenes of 10 common off-the-shelf objects with rich textures are recorded, with each frame annotated with a per-pixel semantic segmentation and ground-truth object poses provided by a commercial motion capturing system.
The paper used 500 scanned Electronic Theses and Dissertation cover pages (i.e., front pages). The dataset contains several intermediate datasets, briefly discussed in the paper.
The temporal variability in calving front positions of marine-terminating glaciers permits inference on the frontal ablation. Frontal ablation, the sum of the calving rate and the melt rate at the terminus, significantly contributes to the mass balance of glaciers. Therefore, the glacier area has been declared as an Essential Climate Variable product by the World Meteorological Organization. The presented dataset provides the necessary information for training deep learning techniques to automate the process of calving front delineation. The dataset includes Synthetic Aperture Radar (SAR) images of seven glaciers distributed around the globe. Five of them are located in Antarctica: Crane, Dinsmoore-Bombardier-Edgeworth, Mapple, Jorum and the Sjörgen-Inlet Glacier. The remaining glaciers are the Jakobshavn Isbrae Glacier in Greenland and the Columbia Glacier in Alaska. Several images were taken for each glacier, forming a time series. The time series lie in the time span between 1995 an
Non-contrast head/brain CT of patients with head trauma or stroke symptoms.
Traffic signs are one of the most important information that guide cars to travel, and the detection of traffic signs is an important component of autonomous driving and intelligent transportation systems. Constructing a traffic sign dataset with many samples and sufficient attribute categories will promote the development of traffic sign detection research. In this paper, we propose a new Chinese traffic sign detection benchmark, which adds more than 4,000 real traffic scene images and corresponding detailed annotations based on our CCTSDB 2017, and replaces many original easily-detected images with difficult samples to adapt to the complex and changing detection environment. Due to the increase of the number of difficult samples, the new benchmark can improve the robustness of the detection network to some extent compared to the old version. At the same time, we create new dedicated test sets and categorize them according to three aspects: category meanings, sign sizes, and weather c
This is the set of graphs used in the PACE 2022 challenge for computing the Directed Feedback Vertex Set, from the Exact track. It consists of 200 labelled directed graphs. The graphs range in size up to from N=512 up to N=131072 vertices, and up to 1315170 edges. The graphs are mostly not symmetric (an edge form u->v does not imply an edge from v->u), although some are symmetric. The graph labels are integers ranging from 1 to N.
Dataset for 'Jet Flavor Classification in High-Energy Physics with Deep Neural Networks'
This is the dataset used in the PACE 2016 challenge, Track B, which was computing minimal Feedback Vertex Set. This competition focused on exact solutions, i.e. provably minimal feedback vertex sets (and no heuristic solutions). This should not be confused with the PACE 2022 challenge, which focused on directed feedback vertex set, and has its own entries on PapersWithCode (exact and heuristic).
Accurately tracking the six degree-of-freedom pose of an object in real scenes is an important task in computer vision and augmented reality with numerous applications. Although a variety of algorithms for this task have been proposed, it remains difficult to evaluate existing methods in the literature as oftentimes different sequences are used and no large benchmark datasets close to real-world scenarios are available. In this paper, we present a large object pose tracking benchmark dataset consisting of RGB-D video sequences of 2D and 3D targets with ground-truth information. The videos are recorded under various lighting conditions, different motion patterns and speeds with the help of a programmable robotic arm. We present extensive quantitative evaluation results of the state-of-the-art methods on this benchmark dataset and discuss the potential research directions in this field.
Fetoscopic Placental Vessel Segmentation and Registration (FetReg2021) challenge was organized as part of the MICCAI2021 Endoscopic Vision (EndoVis) challenge. Through FetReg2021 challenge, we released the first large-scale multi-centre dataset of fetoscopy laser photocoagulation procedure. The dataset contains 2,718 pixel-wise annotated images (for background, vessel, fetus, tool classes) from 24 different in vivo TTTS fetoscopic surgeries and 24 unannotated video clips video clips containing 9,616 frames for training and testing. The dataset is useful for the development of generalized and robust semantic segmentation and video mosaicking algorithms for long duration fetoscopy videos.
The PATIS is a Persian language dataset for intent detection and slot filling.
RuWorldTree is a QA dataset with multiple-choice elementary-level science questions, which evaluate the understanding of core science facts.
RuOpenBookQA is a QA dataset with multiple-choice elementary-level science questions which probe the understanding of core science facts.
CheGeKa is a Jeopardy!-like Russian QA dataset collected from the official Russian quiz database ChGK.