TasksSotADatasetsPapersMethodsSubmitAbout
Papers With Code 2

A community resource for machine learning research: papers, code, benchmarks, and state-of-the-art results.

Explore

Notable BenchmarksAll SotADatasetsPapersMethods

Community

Submit ResultsAbout

Data sourced from the PWC Archive (CC-BY-SA 4.0). Built by the community, for the community.

Papers/Do we really need temporal convolutions in action segmenta...

Do we really need temporal convolutions in action segmentation?

Dazhao Du, Bing Su, Yu Li, Zhongang Qi, Lingyu Si, Ying Shan

2022-05-26Action SegmentationAction Classification
PaperPDFCode(official)

Abstract

Action classification has made great progress, but segmenting and recognizing actions from long untrimmed videos remains a challenging problem. Most state-of-the-art methods focus on designing temporal convolution-based models, but the inflexibility of temporal convolutions and the difficulties in modeling long-term temporal dependencies restrict the potential of these models. Transformer-based models with adaptable and sequence modeling capabilities have recently been used in various tasks. However, the lack of inductive bias and the inefficiency of handling long video sequences limit the application of Transformer in action segmentation. In this paper, we design a pure Transformer-based model without temporal convolutions by incorporating temporal sampling, called Temporal U-Transformer (TUT). The U-Transformer architecture reduces complexity while introducing an inductive bias that adjacent frames are more likely to belong to the same class, but the introduction of coarse resolutions results in the misclassification of boundaries. We observe that the similarity distribution between a boundary frame and its neighboring frames depends on whether the boundary frame is the start or end of an action segment. Therefore, we further propose a boundary-aware loss based on the distribution of similarity scores between frames from attention modules to enhance the ability to recognize boundaries. Extensive experiments show the effectiveness of our model.

Results

TaskDatasetMetricValueModel
Action Localization50 SaladsAcc87.4EUT
Action Localization50 SaladsEdit82.9EUT
Action Localization50 SaladsF1@10%89.2EUT
Action Localization50 SaladsF1@25%87.5EUT
Action Localization50 SaladsF1@50%81EUT
Action LocalizationGTEAAcc77EUT
Action LocalizationGTEAEdit83.9EUT
Action LocalizationGTEAF1@10%88.2EUT
Action LocalizationGTEAF1@25%87.2EUT
Action LocalizationGTEAF1@50%74EUT
Action LocalizationBreakfastAcc75EUT
Action LocalizationBreakfastAverage F169.3EUT
Action LocalizationBreakfastEdit74.6EUT
Action LocalizationBreakfastF1@10%76.2EUT
Action LocalizationBreakfastF1@25%71.8EUT
Action LocalizationBreakfastF1@50%59.8EUT
Action Localization50SaladsAcc87.4EUT
Action Localization50SaladsEdit82.9EUT
Action Localization50SaladsF1@10%89.2EUT
Action Localization50SaladsF1@25%87.5EUT
Action Localization50SaladsF1@50%81EUT
Action Segmentation50 SaladsAcc87.4EUT
Action Segmentation50 SaladsEdit82.9EUT
Action Segmentation50 SaladsF1@10%89.2EUT
Action Segmentation50 SaladsF1@25%87.5EUT
Action Segmentation50 SaladsF1@50%81EUT
Action SegmentationGTEAAcc77EUT
Action SegmentationGTEAEdit83.9EUT
Action SegmentationGTEAF1@10%88.2EUT
Action SegmentationGTEAF1@25%87.2EUT
Action SegmentationGTEAF1@50%74EUT
Action SegmentationBreakfastAcc75EUT
Action SegmentationBreakfastAverage F169.3EUT
Action SegmentationBreakfastEdit74.6EUT
Action SegmentationBreakfastF1@10%76.2EUT
Action SegmentationBreakfastF1@25%71.8EUT
Action SegmentationBreakfastF1@50%59.8EUT
Action Segmentation50SaladsAcc87.4EUT
Action Segmentation50SaladsEdit82.9EUT
Action Segmentation50SaladsF1@10%89.2EUT
Action Segmentation50SaladsF1@25%87.5EUT
Action Segmentation50SaladsF1@50%81EUT

Related Papers

Self-supervised pretraining of vision transformers for animal behavioral analysis and neural encoding2025-07-13HopaDIFF: Holistic-Partial Aware Fourier Conditioned Diffusion for Referring Human Action Segmentation in Multi-Person Scenarios2025-06-11SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis2025-06-09From Play to Replay: Composed Video Retrieval for Temporally Fine-Grained Videos2025-06-05EPFL-Smart-Kitchen-30: Densely annotated cooking dataset with 3D kinematics to challenge video and language models2025-06-02Spatio-Temporal Joint Density Driven Learning for Skeleton-Based Action Recognition2025-05-29SoccerChat: Integrating Multimodal Data for Enhanced Soccer Game Understanding2025-05-22Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes2025-05-21