Mispronunciation detection and diagnosis (MDD) aims to detect erroneous pronunciation segments in an L2 learner's articulation and subsequently provide informative diagnostic feedback. Most existing neural methods follow a dictation-based modeling paradigm that finds out pronunciation errors and returns diagnostic feedback at the same time by aligning the recognized phone sequence uttered by an L2 learner to the corresponding canonical phone sequence of a given text prompt. However, the main downside of these methods is that the dictation process and alignment process are mostly made independent of each other. In view of this, we present a novel end-to-end neural method, dubbed PeppaNet, building on a unified structure that can jointly model the dictation process and the alignment process. The model of our method learns to directly predict the pronunciation correctness of each canonical phone of the text prompt and in turn provides its corresponding diagnostic feedback. In contrast to the conventional dictation-based methods that rely mainly on a free-phone recognition process, PeppaNet makes good use of an effective selective gating mechanism to simultaneously incorporate phonetic, phonological and acoustic cues to generate corrections that are more proper and phonetically related to the canonical pronunciations. Extensive sets of experiments conducted on the L2-ARCTIC benchmark dataset seem to show the merits of our proposed method in comparison to some recent top-of-the-line methods.