This study proposes a multi-modal perception approach to make a robotic arm perform flexible automation and further simplify the complicated coding process of controlling a robotic arm. The depth camera is utilized for detect face and hand gesture for recognizing operator identification and commands. In addition, the kinematics of the robotic arm associated with the position of manipulated objects can be derived based on the information through human demonstrations and detected objects. In the experiments, the proposed multi-modal perception system can firstly recognize the operator. Then, the operator can demonstrate a task to generate the learning data with the assistance of using gesture. Afterward, the robotic arm can perform the same task as human demonstration. During the process of imitating task, the robotic arm can also be guided by the gesture command of operator.