CONF Nastase_CLIC-IT2024-2_2024/IDIAP Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement Nastase, Vivi Jiang, Chunyang Samo, Giuseppe Merlo, Paola cross-lingual diagnostic studies of deep learning models Multilingual syntactic information synthetic structured data Tenth Italian Conference on Computational Linguistics 2024 In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -- we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.

</datafield>

<subfield code="a">Nastase_CLIC-IT2024-2_2024/IDIAP</subfield>

</datafield>

<subfield code="a">Exploring syntactic information in sentence embeddings through multilingual subject-verb agreement</subfield>

</datafield>

<subfield code="a">Nastase, Vivi</subfield>

</datafield>

<subfield code="a">Jiang, Chunyang</subfield>

</datafield>

<subfield code="a">Samo, Giuseppe</subfield>

</datafield>

<subfield code="a">Merlo, Paola</subfield>

</datafield>

<subfield code="a">cross-lingual</subfield>

</datafield>

<subfield code="a">diagnostic studies of deep learning models</subfield>

</datafield>

<subfield code="a">Multilingual</subfield>

</datafield>

<subfield code="a">syntactic information</subfield>

</datafield>

<subfield code="a">synthetic structured data</subfield>

</datafield>

<subfield code="a">Tenth Italian Conference on Computational Linguistics</subfield>

</datafield>

</datafield>

<subfield code="a">In this paper, our goal is to investigate to what degree multilingual pretrained language models capture cross-linguistically valid abstract linguistic representations. We take the approach of developing curated synthetic data on a large scale, with specific properties, and using them to study sentence representations built using pretrained language models. We use a new multiple-choice task and datasets, Blackbird Language Matrices (BLMs), to focus on a specific grammatical structural phenomenon -- subject-verb agreement across a variety of sentence structures -- in several languages. Finding a solution to this task requires a system detecting complex linguistic patterns and paradigms in text representations. Using a two-level architecture that solves the problem in two steps -- detect syntactic objects and their properties in individual sentences, and find patterns across an input sequence of sentences -- we show that despite having been trained on multilingual texts in a consistent manner, multilingual pretrained language models have language-specific differences, and syntactic structure is not shared, even across closely related languages.</subfield>

</datafield>

</record>

</collection>