Popularity prediction of social media becomes a more attractive issue in recent years. It consists of multi-type data sources such as image, meta-data, and text information. In order to effectively predict the popularity of a specified post in the social network, fusing multi-feature from heterogeneous data is required. In this paper, a popularity prediction framework for social media based on multi-modal feature mining is presented. First, we discover image semantic features by extracting their image descriptions generated by image captioning. Second, an effective text-based feature engineering is used to construct an effective word-to-vector model. The trained word-to-vector model is used to encode the text information and the semantic image features. Finally, an ensemble regression approach is proposed to aggregate these encoded features and learn the final regressor. Extensive experiments show that the proposed method significantly outperforms other state-of-the-art regression models. We also show that the multi-modal approach could effectively improve the performance in the social media prediction challenge.