Title: Towards High Performance Model Inference and Training: From Algorithm to Hardware
Ph.D. Candidate: Shaoyi Huang
Major Advisor: Dr. Caiwen Ding
Co-major Advisor: Dr. Omer Khan
Associate Advisor: Dr. Dongkuan Xu, Dr. Yuan Hong, Dr. Jinbo Bi
Date/Time: Monday, July 8th, 2024, 2:00 PM
Location: HBL1102
Meeting link: https://uconn-cmr.webex.com/meet/shh20007
Abstract
In recent years, significant advancements in artificial intelligence have been driven by the development of Deep Neural Networks (DNNs) and Transformer-based models, including BERT, GPT-3, and other Large Language Models (LLMs). These technologies have catalyzed innovations in various fields such as autonomous driving, recommendation systems, and chatbot applications. The models are increasingly designed with deeper, more complex structures and require larger computational resources. As computational demands escalate, model sparsification has emerged as a promising method to reduce model size and computational load during execution. Given the evolution of high-performance computing platforms, particularly advanced GPUs, end-to-end DNNs runtime speedup with model sparsification is an ideal but difficult goal due to the intricacies involved in sparsity which may need the change of matrix and kernel settings.
In this proposal, I will present my works in model inference and training acceleration from both algorithm and hardware levels. It mainly focuses on three innovative aspects: (1) an advanced sparse progressive pruning method which show for the first time that reducing the risk of overfitting can help the effectiveness of pruning on language models; (2) an novel self-attention architecture with attention-specific primitives and an attention-aware pruning design for Transformer-based models inference acceleration; (3) our recent works on sparse training via weights importance exploitation and weights coverage exploration which unlock the sparsity potential and enable the different CNN and GNN models to achieve extremely high sparsity and it’s application on spiking neural network.