TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/HERMES: temporal-coHERent long-forM understanding with Epi...

HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics

Gueter Josmy Faure, Jia-Fong Yeh, Min-Hung Chen, Hung-Ting Su, Shang-Hong Lai, Winston H. Hsu

2024-08-30zero-shot long video question answeringFormVideo Classification
PaperPDFCode(official)

Abstract

Existing research often treats long-form videos as extended short videos, leading to several limitations: inadequate capture of long-range dependencies, inefficient processing of redundant information, and failure to extract high-level semantic concepts. To address these issues, we propose a novel approach that more accurately reflects human cognition. This paper introduces HERMES: temporal-coHERent long-forM understanding with Episodes and Semantics, a model that simulates episodic memory accumulation to capture action sequences and reinforces them with semantic knowledge dispersed throughout the video. Our work makes two key contributions: First, we develop an Episodic COmpressor (ECO) that efficiently aggregates crucial representations from micro to semi-macro levels, overcoming the challenge of long-range dependencies. Second, we propose a Semantics ReTRiever (SeTR) that enhances these aggregated representations with semantic information by focusing on the broader context, dramatically reducing feature dimensionality while preserving relevant macro-level information. This addresses the issues of redundancy and lack of high-level concept extraction. Extensive experiments demonstrate that HERMES achieves state-of-the-art performance across multiple long-video understanding benchmarks in both zero-shot and fully-supervised settings.

Results

TaskDatasetMetricValueModel
VideoBreakfastAccuracy (%)95.2HERMES
VideoCOINAccuracy (%)93.5HERMES
Video ClassificationBreakfastAccuracy (%)95.2HERMES
Video ClassificationCOINAccuracy (%)93.5HERMES

Related Papers

FreeAudio: Training-Free Timing Planning for Controllable Long-Form Text-to-Audio Generation2025-07-11ActAlign: Zero-Shot Fine-Grained Video Classification via Language-Guided Sequence Alignment2025-06-28Controlled Retrieval-augmented Context Evaluation for Long-form RAG2025-06-24FormGym: Doing Paperwork with Agents2025-06-17FreeQ-Graph: Free-form Querying with Semantic Consistent Scene Graph for 3D Scene Understanding2025-06-16Direct Reasoning Optimization: LLMs Can Reward And Refine Their Own Reasoning for Open-Ended Tasks2025-06-16Exploring Audio Cues for Enhanced Test-Time Video Model Adaptation2025-06-14ARGUS: Hallucination and Omission Evaluation in Video-LLMs2025-06-09