Myunggi Lee, Seungeui Lee, Sungjoon Son, Gyu-tae Park, Nojun Kwak
Spatio-temporal representations in frame sequences play an important role in the task of action recognition. Previously, a method of using optical flow as a temporal information in combination with a set of RGB images that contain spatial information has shown great performance enhancement in the action recognition tasks. However, it has an expensive computational cost and requires two-stream (RGB and optical flow) framework. In this paper, we propose MFNet (Motion Feature Network) containing motion blocks which make it possible to encode spatio-temporal information between adjacent frames in a unified network that can be trained end-to-end. The motion block can be attached to any existing CNN-based action recognition frameworks with only a small additional cost. We evaluated our network on two of the action recognition datasets (Jester and Something-Something) and achieved competitive performances for both datasets by training the networks from scratch.
| Task | Dataset | Metric | Value | Model |
|---|---|---|---|---|
| Activity Recognition | Something-Something V1 | Top 1 Accuracy | 43.9 | Motion Feature Net |
| Activity Recognition | Jester (Gesture Recognition) | Val | 96.68 | MFNet |
| Activity Recognition | Something-Something V1 | Top 1 Accuracy | 43.9 | Motion Feature Net |
| Action Recognition | Something-Something V1 | Top 1 Accuracy | 43.9 | Motion Feature Net |
| Action Recognition | Jester (Gesture Recognition) | Val | 96.68 | MFNet |
| Action Recognition | Something-Something V1 | Top 1 Accuracy | 43.9 | Motion Feature Net |
| Action Recognition In Videos | Jester (Gesture Recognition) | Val | 96.68 | MFNet |
| Action Recognition In Videos | Something-Something V1 | Top 1 Accuracy | 43.9 | Motion Feature Net |