Task 016/017 - End-to-End Performance Benchmark / Neuromorphic Design Flow

Event Date: August 12, 2021
Time: 11:00 am (ET) / 8:00am (PT)
Priority: No
College Calendar: Show
Amrit Nagarajan, Purdue University
Specialized Transformers: Faster, Smaller and more Accurate NLP Models
Abstract: Transformers have greatly advanced the state-of-the-art in Natural Language Processing (NLP) in recent years, but are especially demanding in terms of their computation and storage requirements. Transformers are created by first pre-training a very large language model on a large dataset in an unsupervised or self-supervised manner, and subsequently fine-tuning the pre-trained model for different downstream tasks. We observe that this design process leads to models that are often highly over-parameterized for the downstream task at hand. Based on this insight, we introduce a Specialization framework to create optimized transformer models for a given downstream task. Unlike previous approaches that trade-off accuracy for efficiency, our framework is primarily accuracy-driven and actually improves accuracy of the downstream task by systematically identifying and pruning elements of the Transformer that are irrelevant to the task and hence hinder performance. We also replace the dense soft-attention in selected layers with sparse hard-attention to help the model focus on the relevant parts of the input. In effect, our framework leads to models that are not only faster and smaller, but also more accurate. The large number of parameters contained in Transformers presents a challenge in the form of a large pruning design space. Further, the traditional iterative prune-retrain approach is not applicable to Transformers, since the fine-tuning data is often very small and re-training quickly leads to overfitting. To address these challenges, we propose a hierarchical, re-training-free pruning method with model- and task- specific heuristics. Our experiments on three state-of-the-art pre-trained models and 10 downstream tasks show that Specialized models are consistently more accurate (by up to 4.5%), while also being up to 2.7x faster and up to 3.2x smaller than their baseline counterparts. In addition, we demonstrate that Specialization can be combined with previous efforts such as distillation or quantization to achieve further benefits. Our framework does not require any additional training, integrates into the existing fine-tuning process, and can be applied in a plug-and-play manner to any Transformer model while fine-tuning for any downstream task.   
 
Bio: Amrit Nagarajan is a PhD student in the School of Electrical and Computer Engineering, Purdue University, working as a Research Assistant under the supervision of Prof. Anand Raghunathan. He received his B.E degree from Anna University, Chennai, India, in 2018. His research interests include approximate computing and hardware-aware software optimizations for efficient deep learning inference.