THESIS Katharopoulos_THESIS_2022/IDIAP Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models Katharopoulos, Angelos attention autoregressive transformers efficient deep learning Importance Sampling transformers EXTERNAL https://publications.idiap.ch/attachments/papers/2023/Katharopoulos_THESIS_2022.pdf PUBLIC 2022 École Polytechnique Fédérale de Lausanne https://doi.org/10.5075/epfl-thesis-8607 doi Deep neural networks have completely revolutionized the field of machine learning by achieving state-of-the-art results on various tasks ranging from computer vision to protein folding. However, their application is hindered by their large computational and memory requirements. In this thesis, we propose methods for improving the efficiency of deep neural networks. Firstly, we tackle the sample inefficiency of neural network training with an importance sampling algorithm suitable for deep neural networks. This algorithm allows us to focus computation on datapoints that are going to provide useful gradients for training our models and ignore the ones that will have negligible gradients. We show that our algorithm can improve the performance of various neural networks when compared to uniform sampling under a fixed computational budget. Secondly, we design a model that is suitable for processing large input images with a fraction of the computational and memory requirements of traditional approaches. We achieve this by sampling from a data-dependent attention distribution in order to only process a portion of the input in high resolution. We demonstrate that our model can learn both the attention and the features in an end-to-end fashion using only single image-wise labels for supervision. Subsequently, we shift our attention to transformer architectures and introduce a kernelized formulation for self-attention that reduces its quadratic complexity to linear with respect to the input sequence's length. Furthermore, we uncover the relationship between autoregressive transformers and recurrent neural networks and show that our formulation enables up to 3 orders of magnitude faster autoregressive inference. Finally, we develop clustered, attention a method that can approximate softmax transformers with reduced computation. This is achieved by grouping elements of the input using clustering. We showcase that our formulation provides a better trade-off between performance and computation in comparison to the original transformer architecture. In addition, we demonstrate that clustered attention can approximate pretrained transformer models without any fine-tuning and with minimal loss in performance.