Guided Attention for Interpretable Motion Captioning

Karim Radouane, Julien Lagarde, Sylvie Ranwez, Andon Tchechmedjiev

2023-10-11Text Generation Action Localization Spatio-Temporal Video Grounding Motion Captioning Motion Generation

Abstract

Diverse and extensive work has recently been conducted on text-conditioned human motion generation. However, progress in the reverse direction, motion captioning, has seen less comparable advancement. In this paper, we introduce a novel architecture design that enhances text generation quality by emphasizing interpretability through spatio-temporal and adaptive attention mechanisms. To encourage human-like reasoning, we propose methods for guiding attention during training, emphasizing relevant skeleton areas over time and distinguishing motion-related words. We discuss and quantify our model's interpretability using relevant histograms and density distributions. Furthermore, we leverage interpretability to derive fine-grained information about human motion, including action localization, body part identification, and the distinction of motion-related words. Finally, we discuss the transferability of our approaches to other tasks. Our experiments demonstrate that attention guidance leads to interpretable captioning while enhancing performance compared to higher parameter-count, non-interpretable state-of-the-art systems. The code is available at: https://github.com/rd20karim/M2T-Interpretable.

Results

Task	Dataset	Metric	Value	Model
Motion Captioning	HumanML3D	BERTScore	40.3	ST-MLP
Motion Captioning	HumanML3D	BLEU-4	25	ST-MLP
Motion Captioning	KIT Motion-Language	BERTScore	41.2	ST-MLP
Motion Captioning	KIT Motion-Language	BLEU-4	24.4	ST-MLP

Related Papers

Making Language Model a Hierarchical Classifier and Generator2025-07-17 Mitigating Object Hallucinations via Sentence-Level Early Intervention2025-07-16 The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs2025-07-15 Seq vs Seq: An Open Suite of Paired Encoders and Decoders2025-07-15 Hashed Watermark as a Filter: Defeating Forging and Overwriting Attacks in Weight-based Neural Network Watermarking2025-07-15 SnapMoGen: Human Motion Generation from Expressive Texts2025-07-12 Exploiting Leaderboards for Large-Scale Distribution of Malicious Models2025-07-11 CLI-RAG: A Retrieval-Augmented Framework for Clinically Structured and Context Aware Text Generation with LLMs2025-07-09