Automatic mispronunciation detection plays a crucial role in a computer assisted pronunciation training (CAPT) system. The main purpose of mispronunciation detection is to judge whether the pronunciations of a non-native speaker are correct or not. In general, the process of mispronunciation detection can be divided into two parts: 1) a front-end feature extraction module that generates pronunciation detection features based on an input speech segment and its associated reference acoustic models; and 2) a back-end classification module that determines the correctness of the pronunciation of the speech segment according to the output of a classifier that takes the pronunciation detection features of the segment as the input. The main contributions of this work are three-fold. First, we investigate the use of two state-of-the-art acoustic models, respectively based on deep neural networks (DNN) and convolutional neural networks (CNN), and compare their effectiveness for the extraction of discriminative pronunciation detection features. Second, we experiment with different types of classification methods and propose a novel integration of DNN- and CNN-based decision scores at the back-end. Third, we provide an extensive set of empirical evaluations on the aforementioned two modules and associated methods based on a recently compiled corpus for learning Mandarin Chinese as the second language. The experimental results reveal the performance utility of our approach in relation to several existing baselines.