This study used different display modes of video captions in mobile devices, including non-caption, full-caption, and target-words, for English comprehension and vocabulary acquisition of fifth graders. During the one-month experiment, the students' English listening comprehension and vocabulary acquisition status was evaluated per week. From the experimental results, it was found that the English target-word group had as satisfactory learning achievement as the full-caption group in terms of vocabulary acquisition, and both groups outperformed the non-caption group. Moreover, the visual style students in the English target-word group and full-caption group had better learning effectiveness in terms of vocabulary acquisition than those in the non-caption group. Furthermore, in terms of listening comprehension, the students in the three groups all made remarkable progress without significant difference.