Modality Selection and Skill Segmentation via Cross-Modality Attention
Jiawei Jiang, Kei Ota, Devesh K. Jha, Asako Kanezaki
Abstract
Incorporating additional sensory modalities such as tactile and audio into foundational robotic models poses significant challenges due to the curse of dimensionality. This work addresses this issue through modality selection. We propose a cross-modality attention (CMA) mechanism to identify and selectively utilize the modalities that are most informative for action generation at each timestep. Furthermore, we extend the application of CMA to segment primitive skills from expert demonstrations and leverage this segmentation to train a hierarchical policy capable of solving long-horizon, contact-rich manipulation tasks.
Related Papers
VITA: Vision-to-Action Flow Matching Policy2025-07-17WorldVLA: Towards Autoregressive Action World Model2025-06-26Flow-Based Single-Step Completion for Efficient and Expressive Policy Learning2025-06-26Parallels Between VLA Model Post-Training and Human Motor Learning: Progress, Challenges, and Trends2025-06-26VLA-OS: Structuring and Dissecting Planning Representations and Paradigms in Vision-Language-Action Models2025-06-21Compliant Residual DAgger: Improving Real-World Contact-Rich Manipulation with Human Corrections2025-06-20CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity2025-06-19A Survey on Imitation Learning for Contact-Rich Tasks in Robotics2025-06-16