THESIS Mohammadshahi_THESIS_2023/IDIAP Modeling Structured Data in Attention-based Models Mohammadshahi, Alireza EXTERNAL http://publications.idiap.ch/attachments/papers/2023/Mohammadshahi_THESIS_2023.pdf PUBLIC 2023 EPFL https://infoscience.epfl.ch/record/303812 URL Natural language processing has experienced significant improvements with the development of Transformer-based models, which employ self-attention mechanism and pre-training strategies. However, these models still present several obstacles. A notable issue is their inability to encode structured data, e.g. graphs, which is crucial for tasks involving structured knowledge processing. Also, the considerable size of pre-trained Transformer-based models poses challenges for real-world deployment, prompting a growing interest in applying compression methods. Understanding the impact of compression requires examining several aspects including attention patterns in compressed models for different languages, gender and semantic biases. In this thesis, we present an approach for extending Transformer-based models to encode graphs in the attention mechanism and examine the different behaviours of these models following compression. Our first contribution is to propose Graph-to-Graph Transformer (G2GTr) model, which modifies the self-attention mechanism of Transformer for conditioning on graphs in addition to the input sequences. This mechanism incorporates graph relations as continuous embeddings rather than previous works that use a discrete model structure or pre-defined discrete attention heads. An explicit representation of relations is supported by inputting these embeddings into the self-attention mechanism, which is applied to every pair of tokens. In this way, each attention head can easily learn to attend only to tokens in a given relation, while also having the ability to learn additional structures in conjunction with other tokens. We then apply the G2GTr model to encode the partially constructed graph in the transition-based dependency parsing task. Our second contribution is to propose Recursive Non-autoregressive G2GTr, a graph prediction model that exploits the complete graph-to-graph capabilities of G2GTr model to recursively refine the output predicted graph. This model, despite predicting all graph edges simultaneously and being non-autoregressive, is capable of capturing any between-edge dependencies by conditioning on the prior predicted graph, similar to an auto-regressive model. Our third contribution is to propose Syntax-aware G2GTr model to encode the syntactic knowledge in the semantic role labelling task. It conditions on the sentence's dependency structure and predicts SRL graphs. We show empirically that our model surpasses the performance of prior comparable models. Our fourth contribution is to demonstrate the effects of compression methods on multilingual NMT models. We analyse different aspects of compressed models, including attention patterns, gender and semantic biases. We evaluate the impact of compression on different language pairs using the FLORES-101. We also consider MT-Gender and DiBiMT allowing us to assess different types of biases that could be present in the data and MNMT model. Our fifth contribution is to propose SMaLL-100, a Shallow Multilingual Machine Translation Model for Low-Resource Languages covering 100 languages, which is a distilled version of M2M-100 (12B). We focus on low and very low-resource language pairs, given the absence of a reasonably-sized model that delivers satisfactory performance across a substantial number of low-resource languages. We evaluate SMaLL-100 on various low-resource benchmarks, and show significant improvemnt in both efficiency and performance.