In the natural language processing, neural networks are becoming increasingly deeper and complex. The recent state-of-the-art model is the deep language representation model, which includes BERT, ELMo, and GPT. These developments have led to the conviction that previous-generation, shallower neural networks for language understanding are obsolete. However, in this work, we demonstrate how to use these large models to distill knowledge to a student shallow neural networks which can be pruned aggressively in order to create a lightweight neural network 90\% smaller than the original student network, while maintaining its original performance. We propose to distill knowledge from BERT, a state-of-the-art language representation model, into a single-layer LSTM, and on this network apply pruning as well, resulting a sparse, fast network which distill BERT knowledge and outperform regular-trained LSTM. We demonstrate our method across multiple datasets and tasks in natural language processing.