The attention-based encoder-decoder modeling paradigm has achieved impressive success on a wide variety of speech and language processing tasks. This paradigm takes advantage of the innate ability of neural networks to learn a direct and streamlined mapping from an input sequence to an output sequence for ASR, without any prior knowledge like audio- alignments or pronunciation lexicons. An ASR model built on this paradigm, however, is inevitably faced with the issue of inadequate generalization especially when the model is not trained with huge amounts of speech data. In view of this, we in this paper propose a decoder masking based training approach for end-to-end (E2E) ASR models, taking inspiration from the celebrated speech input augmentation (viz. SpecAugment) and masked language modeling (viz. BERT). During the training phase, we randomly replace some portions of the decoder's historical input with the symbol [mask] to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked. The proposed approach is instantiated with the top-of-the-line transformer-based E2E ASR model. Extensive experiments conducted on two benchmark datasets (viz. Librispeech960h and TedLium2) seem to demonstrate the efficacy of our approach in relation to some existing E2E ASR systems.