Papers With Code 2 | ML Benchmarks, SotA Results & Code

Cambridge Law Corpus (The Cambridge Law Corpus: A Dataset for Legal AI Research)

We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.

4 papers0 benchmarksTexts

MMToM-QA (Multimodal Theory of Mind Question Answering)

MMToM-QA is the first multimodal benchmark to evaluate machine Theory of Mind (ToM), the ability to understand people's minds. MMToM-QA consists of 600 questions. Each question is paired with a clip of the full activity in a video (as RGB-D frames), as well as a text description of the scene and the actions taken by the person in that clip. All questions have two choices. The questions are categorized into seven types, evaluating belief inference and goal inference in rich and diverse situations. Each belief inference type has 100 questions, totaling 300 belief questions; each goal inference type has 75 questions, totaling 300 goal questions. The questions are paired with 134 videos of a person looking for daily objects in household environments.

4 papers0 benchmarksImages, RGB Video, RGB-D, Texts, Videos

LingOly

This dataset is a benchmark for complex reasoning abilities in large language models, drawing on United Kingdom Linguistics Olympiad problems which cover a wide range of languages.

4 papers2 benchmarksTexts

SynthPAI (SynthPAI: A Synthetic Dataset for Personal Attribute Inference)

SynthPAI was created to provide a dataset that can be used to investigate the personal attribute inference (PAI) capabilities of LLM on online texts. Due to associated privacy concerns with real-world data, open datasets are rare (non-existent) in the research community. SynthPAI is a synthetic dataset that aims to fill this gap.

4 papers1 benchmarksTexts

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSRVTT-CTN Dataset This dataset contains CTN annotations for the MSRVTT-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

4 papers3 benchmarksTexts, Videos

MSVD-CTN (MSVD Causal-Temporal Narrative)

MSVD-CTN Dataset This dataset contains CTN annotations for the MSVD-CTN benchmark dataset in JSON format. It has three files for the train, test, and validation splits. For project details, visit https://narrativebridge.github.io/.

4 papers3 benchmarksTexts, Videos

VLKEB

A Large Vision-Language Model Knowledge Editing Benchmark

4 papers0 benchmarksImages, Texts

CAsT-snippets

CAsT-snippets is a high-quality dataset for conversational information seeking containing snippet-level annotations for all queries in the TREC CAsT 2020 and 2022 datasets. It enables the development of answer generation methods that are grounded in relevant snippets in paragraphs as well as allows for the automatic evaluation of the generated answers in terms of completeness; a training/test split is provided for such use.

4 papers0 benchmarksTexts

SecQA

SecQA is a specialized dataset created for the evaluation of Large Language Models (LLMs) in the domain of computer security. It consists of multiple-choice questions, generated using GPT-4 and the Computer Systems Security: Planning for Success textbook, aimed at assessing the understanding and application of LLMs' knowledge in computer security.

4 papers0 benchmarksTexts

LIAR2

The LIAR dataset has been widely followed by fake news detection researchers since its release, and along with a great deal of research, the community has provided a variety of feedback on the dataset to improve it. We adopted these feedbacks and released the LIAR2 dataset, a new benchmark dataset of ~23k manually labeled by professional fact-checkers for fake news detection tasks. We have used a split ratio of 8:1:1 to distinguish between the training set, the test set, and the validation set, details of which are provided in the paper of "An Enhanced Fake News Detection System With Fuzzy Deep Learning". The LIAR2 dataset can be accessed at Huggingface and Github, and statistical information for LIAR and LIAR2 is provided in the table below:

4 papers3 benchmarksTexts

RoadTextVQA

Text and signs around roads provide crucial information for drivers, vital for safe navigation and situational awareness. Scene text recognition in motion is a challenging problem, while textual cues typically appear for a short time span, and early detection at a distance is necessary. Systems that exploit such information to assist the driver should not only extract and incorporate visual and textual cues from the video stream but also reason over time. To address this issue, we introduce RoadTextVQA, a new dataset for the task of video question answering (VideoQA) in the context of driver assistance. RoadTextVQA consists of 3,222 driving videos collected from multiple countries, annotated with 10,500 questions, all based on text or road signs present in the driving videos. We assess the performance of state-of-the-art video question answering models on our RoadTextVQA dataset, highlighting the significant potential for improvement in this domain and the usefulness of the dataset in

4 papers1 benchmarksTexts, Videos

LayoutBench-COCO - Number

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LayoutBench-COCO - Position

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LayoutBench-COCO - Size

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LayoutBench-COCO - Combination

LayoutBench-COCO is a diagnostic benchmark that examines layout-guided image generation models on arbitrary, unseen layouts. Unlike LayoutBench, LayoutBench-COCO consists of OOD layouts of real objects and suports zero-shot evaluation. LayoutBench-COCO measures 4 skills (Number, Position, Size, Combination), whose objects are from MS COCO. The new 'combination’ split consists of layouts with two objects in different spatial relations, and the remaining three splits are similar to those of LayoutBench. Download dataset at: https://huggingface.co/datasets/j-min/layoutbench-coco

4 papers1 benchmarksImages, Texts

LAM(line-level) (The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition)

Handwritten Text Recognition (HTR) is an open problem at the intersection of Computer Vision and Natural Language Processing. The main challenges, when dealing with historical manuscripts, are due to the preservation of the paper support, the variability of the handwriting – even of the same author over a wide time-span – and the scarcity of data from ancient, poorly represented languages. With the aim of fostering the research on this topic, in this paper we present the Ludovico Antonio Muratori (LAM) dataset, a large line-level HTR dataset of Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available. For both co

4 papers4 benchmarksImages, Texts

RetVQA (Retrieval-Based Visual Question Answering)

The RetVQA dataset is a large-scale dataset designed for Retrieval-Based Visual Question Answering (RetVQA). RetVQA is a more challenging task than traditional VQA, as it requires models to retrieve relevant images from a pool of images before answering a question. The need for RetVQA stems from the fact that information needed to answer a question may be spread across multiple images.

4 papers2 benchmarksImages, Texts

MuseASTE (MuSe-CarASTE: A comprehensive dataset for aspect sentiment triplet extraction in automotive review videos)

•A new benchmark dataset for Aspect Sentiment Triplet Extraction. •First Aspect Sentiment Triplet Extraction (ASTE) Dataset in Automotive Domain. •Largest ASTE Dataset to date with annotations for over 28,295 sentences. •Dataset includes complex aspects not verbatim present in the sentence. •Domain: Aspect-based sentiment analysis, ASTE, Opinion Mining, Recommender System. •Four baseline SOTA models implemented on the dataset

4 papers1 benchmarksTexts

SCOUT: The Situated Corpus of Understanding Transaction

The Situated Corpus Of Understanding Transactions (SCOUT) is a multi-modal collection of human-robot dialogue in the task domain of collaborative exploration. The corpus was constructed from multi-phased Wizard-of-Oz experiments where human participants gave verbal instructions to a remotely-located robot to move and gather information about its surroundings. Each dialogue involved a human Commander, a Dialogue Manager (DM), and a Robot Navigator (RN), and took place in physical or simulated environments.

4 papers0 benchmarksDialog, Images, Interactive, LiDAR, Texts

MixSet (Mixcase Dataset)

MIXSET comprises a total of 3.6k mixtext instances. The dataset features a blend of HWT(human-written text) and MGT(machine-generated text).

4 papers0 benchmarksTexts

Datasets

Cambridge Law Corpus (The Cambridge Law Corpus: A Dataset for Legal AI Research)

MMToM-QA (Multimodal Theory of Mind Question Answering)

LingOly

SynthPAI (SynthPAI: A Synthetic Dataset for Personal Attribute Inference)

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSVD-CTN (MSVD Causal-Temporal Narrative)

VLKEB

CAsT-snippets

SecQA

LIAR2

RoadTextVQA

LayoutBench-COCO - Number

LayoutBench-COCO - Position

LayoutBench-COCO - Size

LayoutBench-COCO - Combination

LAM(line-level) (The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition)

RetVQA (Retrieval-Based Visual Question Answering)

MuseASTE (MuSe-CarASTE: A comprehensive dataset for aspect sentiment triplet extraction in automotive review videos)

SCOUT: The Situated Corpus of Understanding Transaction

MixSet (Mixcase Dataset)

Datasets

Cambridge Law Corpus (The Cambridge Law Corpus: A Dataset for Legal AI Research)

MMToM-QA (Multimodal Theory of Mind Question Answering)

LingOly

SynthPAI (SynthPAI: A Synthetic Dataset for Personal Attribute Inference)

MSRVTT-CTN (MSRVTT Causal-Temporal Narrative)

MSVD-CTN (MSVD Causal-Temporal Narrative)

VLKEB

CAsT-snippets

SecQA

LIAR2

RoadTextVQA

LayoutBench-COCO - Number

LayoutBench-COCO - Position

LayoutBench-COCO - Size

LayoutBench-COCO - Combination

LAM(line-level) (The LAM Dataset: A Novel Benchmark for Line-Level Handwritten Text Recognition)

RetVQA (Retrieval-Based Visual Question Answering)

MuseASTE (MuSe-CarASTE: A comprehensive dataset for aspect sentiment triplet extraction in automotive review videos)

SCOUT: The Situated Corpus of Understanding Transaction

MixSet (Mixcase Dataset)