TY - JOUR
T1 - Skeleton-based human action recognition using LSTM and depthwise separable convolutional neural network
AU - Le, Hoangcong
AU - Lu, Cheng Kai
AU - Hsu, Chen Chien
AU - Huang, Shao Kang
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2025.
PY - 2025/4
Y1 - 2025/4
N2 - In the field of computer vision, the task of human action recognition (HAR) represents a challenge, due to the complexity of capturing nuanced human movements from video data. To address this issue, researchers have developed various algorithms. In this study, a novel two-stream architecture is developed that combines LSTM with a depthwise separable convolutional neural network (DSConV) and skeleton information, with the aim of enhancing the accuracy of HAR. The 3D coordinates of each joint in the skeleton are extracted using the Mediapipe library, and the 2D coordinates are obtained using MoveNet. The proposed method comprises two streams, called the temporal LSTM module and the joint-motion module, and was developed to overcome the limitations of prior two-stream RNN models, such as the vanishing gradient problem and the difficulty of effectively extracting temporal-spatial information. A performance evaluation on the benchmark datasets of JHMDB (73.31%), Florence-3D Action (97.67%), SBU Interaction (95.2%), and Penn Action (94.0%) showcases the effectiveness of the proposed model. A comparison with state-of-the-art methods demonstrates the superior performance of the approach on these datasets. This study contributes to advancing the field of HAR, with potential applications in surveillance and robotics.
AB - In the field of computer vision, the task of human action recognition (HAR) represents a challenge, due to the complexity of capturing nuanced human movements from video data. To address this issue, researchers have developed various algorithms. In this study, a novel two-stream architecture is developed that combines LSTM with a depthwise separable convolutional neural network (DSConV) and skeleton information, with the aim of enhancing the accuracy of HAR. The 3D coordinates of each joint in the skeleton are extracted using the Mediapipe library, and the 2D coordinates are obtained using MoveNet. The proposed method comprises two streams, called the temporal LSTM module and the joint-motion module, and was developed to overcome the limitations of prior two-stream RNN models, such as the vanishing gradient problem and the difficulty of effectively extracting temporal-spatial information. A performance evaluation on the benchmark datasets of JHMDB (73.31%), Florence-3D Action (97.67%), SBU Interaction (95.2%), and Penn Action (94.0%) showcases the effectiveness of the proposed model. A comparison with state-of-the-art methods demonstrates the superior performance of the approach on these datasets. This study contributes to advancing the field of HAR, with potential applications in surveillance and robotics.
KW - 3D and 2D skeleton
KW - Human action recognition
KW - LSTM and DSConV architecture
KW - Overlapping technique
UR - http://www.scopus.com/inward/record.url?scp=85217821696&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217821696&partnerID=8YFLogxK
U2 - 10.1007/s10489-024-06082-w
DO - 10.1007/s10489-024-06082-w
M3 - Article
AN - SCOPUS:85217821696
SN - 0924-669X
VL - 55
JO - Applied Intelligence
JF - Applied Intelligence
IS - 5
M1 - 298
ER -