Extractive text or speech summarization seeks to select indicative sentences from a source document and assemble them together to form a succinct summary, so as to help people to browse and understand the main theme of the document efficiently. A more recent trend is towards developing supervised deep learning based methods for extractive summarization. This paper extends and contextualizes this line of research for spoken document summarization, while its contributions are at least three-fold. First, we propose a neural summarization framework with the flexibility to incorporate extra acoustic/prosodic and lexical features, for which the ROUGE evaluation metric is embedded into the training objective function and can be optimized with reinforcement learning. Second, disparate ways to integrate acoustic features into this framework are investigated. Third, the utility of our proposed summarization methods and several widely-used state-of-the-art ones are extensively compared and evaluated. A series of empirical experiments seem to demonstrate the effectiveness of our summarization methods.