Extractive text or speech summarization endeavors to select representative sentences from a source document and assemble them into a concise summary, so as to help people to browse and assimilate the main theme of the document efficiently. The recent past has seen a surge of interest in developing deep learning- or deep neural network-based supervised methods for extractive text summarization. This paper presents a continuation of this line of research for speech summarization and its contributions are three-fold. First, we exploit an effective framework that integrates two convolutional neural networks (CNNs) and a multilayer perceptron (MLP) for summary sentence selection. Specifically, CNNs encode a given document-sentence pair into two discriminative vector embeddings separately, while MLP in turn takes the two embeddings of a document-sentence pair and their similarity measure as the input to induce a ranking score for each sentence. Second, the input of MLP is augmented by a rich set of prosodic and lexical features apart from those derived from CNNs. Third, the utility of our proposed summarization methods and several widely-used methods are extensively analyzed and compared. The empirical results seem to demonstrate the effectiveness of our summarization method in relation to several state-of-the-art methods.