TY - GEN
T1 - A vision-based human action recognition system for moving cameras through deep learning
AU - Chang, Ming Jen
AU - Hsieh, Jih Tang
AU - Fang, Chiung Yao
AU - Chen, Sei Wang
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/11/27
Y1 - 2019/11/27
N2 - This study presents a vision-based human action recognition system using a deep learning technique. The system can recognize human actions successfully when the camera of a robot is moving toward the target person from various directions. Therefore, the proposed method is useful for the vision system of indoor mobile robots. The system uses three types of information to recognize human actions, namely, information from color videos, optical flow videos, and depth videos. First, Kinect 2.0 captures color videos and depth videos simultaneously using its RGB camera and depth sensor. Second, the histogram of oriented gradient features is extracted from the color videos, and a support vector machine is used to detect the human region. Based on the detected human region, the frames of the color video are cropped and the corresponding frames of the optical flow video are obtained using the Farnebäck method (https://docs.opencv=.org/3.4/d4/dee/tutorial-optical-flow.html). The number of frames of these videos is then unified using a frame sampling technique. Subsequently, these three types of videos are input into three modified 3D convolutional neural networks (3D CNNs) separately. The modified 3D CNNs can extract the spatiotemporal features of human actions and recognize them. Finally, these recognition results are integrated to output the final recognition result of human actions. The proposed system can recognize 13 types of human actions, namely, drink (sit), drink (stand), eat (sit), eat (stand), read, sit down, stand up, use a computer, walk (horizontal), walk (straight), play with a phone/tablet, walk away from each other, and walk toward each other. The average human action recognition rate of 369 test human action videos was 96.4%, indicating that the proposed system is robust and efficient.
AB - This study presents a vision-based human action recognition system using a deep learning technique. The system can recognize human actions successfully when the camera of a robot is moving toward the target person from various directions. Therefore, the proposed method is useful for the vision system of indoor mobile robots. The system uses three types of information to recognize human actions, namely, information from color videos, optical flow videos, and depth videos. First, Kinect 2.0 captures color videos and depth videos simultaneously using its RGB camera and depth sensor. Second, the histogram of oriented gradient features is extracted from the color videos, and a support vector machine is used to detect the human region. Based on the detected human region, the frames of the color video are cropped and the corresponding frames of the optical flow video are obtained using the Farnebäck method (https://docs.opencv=.org/3.4/d4/dee/tutorial-optical-flow.html). The number of frames of these videos is then unified using a frame sampling technique. Subsequently, these three types of videos are input into three modified 3D convolutional neural networks (3D CNNs) separately. The modified 3D CNNs can extract the spatiotemporal features of human actions and recognize them. Finally, these recognition results are integrated to output the final recognition result of human actions. The proposed system can recognize 13 types of human actions, namely, drink (sit), drink (stand), eat (sit), eat (stand), read, sit down, stand up, use a computer, walk (horizontal), walk (straight), play with a phone/tablet, walk away from each other, and walk toward each other. The average human action recognition rate of 369 test human action videos was 96.4%, indicating that the proposed system is robust and efficient.
KW - 3D convolutional neural network
KW - Color information
KW - Deep learning
KW - Depth information
KW - Human action recognition
KW - Indoor mobile robots
KW - Moving camera
KW - Optical flow
KW - VGG net
UR - http://www.scopus.com/inward/record.url?scp=85079084674&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85079084674&partnerID=8YFLogxK
U2 - 10.1145/3372806.3372815
DO - 10.1145/3372806.3372815
M3 - Conference contribution
AN - SCOPUS:85079084674
T3 - ACM International Conference Proceeding Series
SP - 85
EP - 91
BT - Proceedings of 2019 2nd International Conference on Signal Processing and Machine Learning, SPML 2019
PB - Association for Computing Machinery
T2 - 2nd International Conference on Signal Processing and Machine Learning, SPML 2019
Y2 - 27 November 2019 through 29 November 2019
ER -