Existing online video processing methods such as online action detection focus on a frame-level understanding for high responsiveness. However, it has a fundamental limitation in that it lacks instance-level understanding of videos, making it difficult to be applied to higher-level vision tasks. The instance-level action detection, known as Temporal Action Localization (TAL), have limitations when applying to the online settings. In this work, we introduce a new task that aims to detect action instances of videos in an online setting, named Online Temporal Action Localization (OnTAL). To tackle this problem, we propose a 2-Pass End/Start detection Network (2PESNet) that detects action instances by effectively finding the start and end of an action instance. Additionally, we propose a two-stage action end detection method to further improve the performance. Extensive experiments on THUMOS’14 and ActivityNet v1.3 demonstrate that our model is able to take both accuracy and responsiveness when predicting action instances from streaming videos.
|Publication status||Published - 2022 Nov|
Bibliographical noteFunding Information:
This work was conducted by Center for Applied Research in Artificial Intelligence(CARAI) grant funded by Defense Acquisition Program Administration(DAPA) and Agency for Defense Development(ADD) (UD190031RD).
All Science Journal Classification (ASJC) codes
- Signal Processing
- Computer Vision and Pattern Recognition
- Artificial Intelligence