Multimodal Large Models Are Effective Action Anticipators

Binglu Wang, Yao Tian, Shunzhou Wang, Le Yang

2025-01-01Long Term Action Anticipation Action Anticipation Specificity

Abstract

The task of long-term action anticipation demands solutions that can effectively model temporal dynamics over extended periods while deeply understanding the inherent semantics of actions. Traditional approaches, which primarily rely on recurrent units or Transformer layers to capture long-term dependencies, often fall short in addressing these challenges. Large Language Models (LLMs), with their robust sequential modeling capabilities and extensive commonsense knowledge, present new opportunities for long-term action anticipation. In this work, we introduce the ActionLLM framework, a novel approach that treats video sequences as successive tokens, leveraging LLMs to anticipate future actions. Our baseline model simplifies the LLM architecture by setting future tokens, incorporating an action tuning module, and reducing the textual decoder layer to a linear layer, enabling straightforward action prediction without the need for complex instructions or redundant descriptions. To further harness the commonsense reasoning of LLMs, we predict action categories for observed frames and use sequential textual clues to guide semantic understanding. In addition, we introduce a Cross-Modality Interaction Block, designed to explore the specificity within each modality and capture interactions between vision and textual modalities, thereby enhancing multimodal tuning. Extensive experiments on benchmark datasets demonstrate the superiority of the proposed ActionLLM framework, encouraging a promising direction to explore LLMs in the context of action anticipation. Code is available at https://github.com/2tianyao1/ActionLLM.git.

Related Papers

RadiomicsRetrieval: A Customizable Framework for Medical Image Retrieval Using Radiomics Features2025-07-11 PoseMaster: Generating 3D Characters in Arbitrary Poses from a Single Image2025-06-26 IMC-PINN-FE: A Physics-Informed Neural Network for Patient-Specific Left Ventricular Finite Element Modeling with Image Motion Consistency and Biomechanical Parameter Estimation2025-06-25 Predicting Readiness to Engage in Psychotherapy of People with Chronic Pain Based on their Pain-Related Narratives Saar2025-06-25 On Context-Content Uncertainty Principle2025-06-25 Visual-Semantic Knowledge Conflicts in Operating Rooms: Synthetic Data Curation for Surgical Risk Perception in Multimodal Large Language Models2025-06-25 WebGuard++:Interpretable Malicious URL Detection via Bidirectional Fusion of HTML Subgraphs and Multi-Scale Convolutional BERT2025-06-24 Modeling the influences of non-local connectomic projections on geometrically constrained cortical dynamics2025-06-24