TY - GEN
T1 - An Implementation of Job Running Backup Function in User-PC Computing System
AU - Htet, Hein
AU - Funabiki, Nobuo
AU - Kamoyedji, Ariel
AU - Zhou, Xudong
AU - Xiang, Xu
AU - Sugawara, Shinji
AU - Kao, Wen Chung
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - As a low-cost and high-performance distributed computing platform, we have studied the User-PC Computing (UPC) system based on the master-worker model. Docker container technology is adopted to run various application programs or jobs on heterogeneous PC environments for workers. Some jobs, such as physics simulations and neural networks, require long CPU time, which increases the probability of failure of running workers. The automatic backup of the job running state and migration to other worker will be essential to reduce the job completion delay. In this paper, we implement the job running backup function in the UPC system. Checkpoint-Restore in Userspace (CRIU) is periodically applied to capture the job running state of the running job at a worker. When the master detects the failure, it automatically migrates the job to another worker. To evaluate the function, we conducted experiments using the testbed UPC system with 14 jobs and six workers of different specifications, and confirmed that the proposal successfully resumes the job running from the interrupted point at another worker.
AB - As a low-cost and high-performance distributed computing platform, we have studied the User-PC Computing (UPC) system based on the master-worker model. Docker container technology is adopted to run various application programs or jobs on heterogeneous PC environments for workers. Some jobs, such as physics simulations and neural networks, require long CPU time, which increases the probability of failure of running workers. The automatic backup of the job running state and migration to other worker will be essential to reduce the job completion delay. In this paper, we implement the job running backup function in the UPC system. Checkpoint-Restore in Userspace (CRIU) is periodically applied to capture the job running state of the running job at a worker. When the master detects the failure, it automatically migrates the job to another worker. To evaluate the function, we conducted experiments using the testbed UPC system with 14 jobs and six workers of different specifications, and confirmed that the proposal successfully resumes the job running from the interrupted point at another worker.
KW - CRIU
KW - Docker
KW - Podman
KW - UPC system
KW - periodic checkpoint
UR - http://www.scopus.com/inward/record.url?scp=85137076496&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137076496&partnerID=8YFLogxK
U2 - 10.1109/ICCCI55554.2022.9850241
DO - 10.1109/ICCCI55554.2022.9850241
M3 - Conference contribution
AN - SCOPUS:85137076496
T3 - 2022 4th International Conference on Computer Communication and the Internet, ICCCI 2022
SP - 156
EP - 161
BT - 2022 4th International Conference on Computer Communication and the Internet, ICCCI 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 4th International Conference on Computer Communication and the Internet, ICCCI 2022
Y2 - 1 July 2022 through 3 July 2022
ER -