19,997 machine learning datasets
19,997 dataset results
This dataset contains five notable histological artifacts: blur, blood (hemorrhage), air bubbles, folded tissue, and damaged tissue. This dataset is used in the following works, and a description of the dataset can be found at https://zenodo.org/records/10809442.
The dataset used to pre-train NuNER from the NuNER: Entity Recognition Encoder Pre-training via LLM-Annotated Data
The Polaris dataset offers a large-scale, diverse benchmark for evaluating metrics for image captioning, surpassing existing datasets in terms of size, caption diversity, number of human judgments, and granularity of the evaluations. It includes 131,020 generated captions and 262,040 reference captions. The generated captions have a vocabulary of 3,154 unique words and the reference captions have a vocabulary of 22,275 unique words.
DrivAerNet is a large-scale, high-fidelity CFD dataset of 3D industry-standard car shapes designed for data-driven aerodynamic design. It comprises 4000 high-quality 3D car meshes and their corresponding aerodynamic performance coefficients, alongside full 3D flow field information.
NPHardEval4V is a dynamic reasoning benchmark designed to evaluate the reasoning capabilities of Multimodal Large Language Models (MLLMs). Let me provide you with more details:
ImageNet-D contains 4835 test images featuring diverse backgrounds (3,764), textures (498), and materials (573). Generated by diffusion models, ImageNet-D achieves superior image fidelity and collection efficiency than prior studies. Evaluation results show that ImageNet-D results in a significant accuracy drop to a range of vision models, from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4, significantly reducing their accuracy by up to 60%.
We introduced a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world’s largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country.
The MA-52 dataset provides the whole-body perspective including gestures, upper- and lower-limb movements, attempting to reveal comprehensive micro-action cues. In detail, MA-52 contains 52 micro-action categories along with seven body part labels, and encompasses a full array of realistic and natural micro-actions, accounting for 205 participants and 22,422 video instances collated from the psychological interviews
Tran Viet Khoa, Do Hai Son, Dinh Thai Hoang, Nguyen Linh Trung, Tran Thi Thuy Quynh, Diep N. Nguyen, Nguyen Viet Ha, and Eryk Dutkiewicz, "Collaborative Learning for Cyberattack Detection in Blockchain Networks," IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 54, no. 7, pp. 3920-3933, Apr. 2024.
Emotion recognition is a higher approach or special case of sentiment analysis. In this task, the result is not produced in terms of either polarity: positive or negative or in the form of rating (from 1 to 5) but of a more detailed level of sentiment analysis in which the result are depicted in more expressions like sadness, enjoyment, anger, disgust, fear and surprise. Emotion recognition plays a critical role in measuring brand value of a product by recognizing specific emotions of customers’ comments. In this study, we have achieved two targets. First and foremost, we built a standard Vietnamese Social Media Emotion Corpus (UIT-VSMEC) with about 6,927 human-annotated sentences with six emotion labels, contributing to emotion recognition research in Vietnamese which is a low-resource language in Natural Language Processing (NLP). Secondly, we assessed and measured machine learning and deep neural network models on our UIT-VSMEC. As a result, Convolutional Neural Network (CNN) model
LLM-Seg40K dataset contains 14K images in total. The dataset is divided into training, validation, and test sets, containing 11K, 1K, and 2K images respectively. For the training split, each image has 3.95 questions on average and the average question question length is 15.2 words. The training set contains 1458 different categories in total.
In recent years, visual question answering (VQA) has attracted attention from the research community because of its highly potential applications (such as virtual assistance on intelligent cars, assistant devices for blind people, or information retrieval from document images using natural language as queries) and challenge. The VQA task requires methods that have the ability to fuse the information from questions and images to produce appropriate answers. Neural visual question answering models have achieved tremendous growth on large-scale datasets which are mostly for resource-rich languages such as English. However, available datasets narrow the VQA task as the answers selection task or answer classification task. We argue that this form of VQA is far from human ability and eliminates the challenge of the answering aspect in the VQA task by just selecting answers rather than generating them. In this paper, we introduce the OpenViVQA (Open-domain Vietnamese Visual Question Answering
EVJVQA, the first multilingual Visual Question Answering dataset with three languages: English, Vietnamese, and Japanese, is released in this task. UIT-EVJVQA includes question-answer pairs created by humans on a set of images taken in Vietnam, with the answer created from the input question and the corresponding image. EVJVQA consists of 33,000+ question-answer pairs for evaluating the mQA models.
From my knowledge, the dataset used in the project is the largest crack segmentation dataset so far. It contains around 11.200 images that are merged from 12 available crack segmentation datasets.
FollowIR tests whether retrieval models can take queries that have fine-grained instructions in them
The Cat Facial Landmarks in the Wild (CatFLW) dataset contains 2079 images of cats' faces in various environments and conditions, annotated with 48 facial landmarks and a bounding box on the cat’s face.
We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
Human-Animal-Cartoon (HAC) dataset consists of seven actions (‘sleeping’, ‘watching tv’, ‘eating’, ‘drinking’, ‘swimming’, ‘running’, and ‘opening door’) performed by humans, animals, and cartoon figures, forming three different domains. 3381 video clips are collected from the internet with around 1000 for each domain and three modalities are provided in the dataset: video, audio, and optical flow.
EGSet12 is a small dataset with twelve original and real solo electric guitar performances (31.65 seconds avg. duration, totaling 379.8 seconds). These pieces were composed by a professional musician and guitar player, showcasing the full tonal range of the electric guitar across diverse melodies and chord complexities. EGSet12 encompasses a broad spectrum of styles, including pop, funk, jazz and twelve-tone, reflecting varied tonalities, keys, rhythms, and modes. EGSet12 was performed by a single professional guitarist using a Sire T7 Telecaster guitar and a Yamaha B15 amplifier. The performance setup allowed the performer to freely select the guitar’s volume and tone knobs, also allowing techniques like alternate picking, hybrid picking, and palm mute. We captured the performance using an ECM8000 microphone positioned 15 centimeters from the amplifier and connected to a UMC202 HD audio interface (original sampling rate of 48000Hz; each file's stereo channels are a copy of each other;
MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. MMToM-QA consists of 600 questions. Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip. All questions have two choices. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions. The questions are paired with 134 videos of a person looking for daily objects in household environments.