TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/DisCoVQA: Temporal Distortion-Content Transformers for Vid...

DisCoVQA: Temporal Distortion-Content Transformers for Video Quality Assessment

HaoNing Wu, Chaofeng Chen, Liang Liao, Jingwen Hou, Wenxiu Sun, Qiong Yan, Weisi Lin

2022-06-20Video Quality AssessmentVisual Question Answering (VQA)Time Series Analysis
PaperPDFCode(official)

Abstract

The temporal relationships between frames and their influences on video quality assessment (VQA) are still under-studied in existing works. These relationships lead to two important types of effects for video quality. Firstly, some temporal variations (such as shaking, flicker, and abrupt scene transitions) are causing temporal distortions and lead to extra quality degradations, while other variations (e.g. those related to meaningful happenings) do not. Secondly, the human visual system often has different attention to frames with different contents, resulting in their different importance to the overall video quality. Based on prominent time-series modeling ability of transformers, we propose a novel and effective transformer-based VQA method to tackle these two issues. To better differentiate temporal variations and thus capture the temporal distortions, we design a transformer-based Spatial-Temporal Distortion Extraction (STDE) module. To tackle with temporal quality attention, we propose the encoder-decoder-like temporal content transformer (TCT). We also introduce the temporal sampling on features to reduce the input length for the TCT, so as to improve the learning effectiveness and efficiency of this module. Consisting of the STDE and the TCT, the proposed Temporal Distortion-Content Transformers for Video Quality Assessment (DisCoVQA) reaches state-of-the-art performance on several VQA benchmarks without any extra pre-training datasets and up to 10% better generalization ability than existing methods. We also conduct extensive ablation experiments to prove the effectiveness of each part in our proposed model, and provide visualizations to prove that the proposed modules achieve our intention on modeling these temporal issues. We will publish our codes and pretrained weights later.

Results

TaskDatasetMetricValueModel
Video UnderstandingLIVE-VQCPLCC0.844DisCoVQA
Video UnderstandingKoNViD-1kPLCC0.86DisCoVQA
Video UnderstandingLIVE-FB LSVQPLCC0.85DisCoVQA
Video Quality AssessmentLIVE-VQCPLCC0.844DisCoVQA
Video Quality AssessmentKoNViD-1kPLCC0.86DisCoVQA
Video Quality AssessmentLIVE-FB LSVQPLCC0.85DisCoVQA
VideoLIVE-VQCPLCC0.844DisCoVQA
VideoKoNViD-1kPLCC0.86DisCoVQA
VideoLIVE-FB LSVQPLCC0.85DisCoVQA

Related Papers

VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning2025-07-17Emergence of Functionally Differentiated Structures via Mutual Information Optimization in Recurrent Neural Networks2025-07-17MGFFD-VLM: Multi-Granularity Prompt Learning for Face Forgery Detection with VLM2025-07-16Describe Anything Model for Visual Question Answering on Text-rich Images2025-07-16Evaluating Attribute Confusion in Fashion Text-to-Image Generation2025-07-09LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation2025-07-09Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder2025-06-28Bridging Video Quality Scoring and Justification via Large Multimodal Models2025-06-26