CONF Sarfjoo_INTERSPEECH_2020/IDIAP Supervised domain adaptation for text-independent speaker verification using limited data Sarfjoo, Seyyed Saeed Madikeri, Srikanth Motlicek, Petr Marcel, Sébastien batch norm speaker recognition speaker verification supervised adaptation transfer learning EXTERNAL http://publications.idiap.ch/attachments/papers/2020/Sarfjoo_INTERSPEECH_2020.pdf PUBLIC Interspeech 2020 3815-3819 http://www.interspeech2020.org/uploadfile/pdf/Thu-1-7-4.pdf URL To adapt the speaker verification (SV) system to a target domain with limited data, this paper investigates the transfer learning of the model pre-trained on the source domain data. To that end, layer-by-layer adaptation with transfer learning from the initial and final layers of the pre-trained model is investigated. We show that the model adapted from the initial layers outperforms the model adapted from the final layers. Based on this evidence, and inspired by the works in image recognition field, we hypothesize that low-level convolutional neural network (CNN) layers characterize domain-specific component while high-level CNN layers are domain-independent and have more discriminative power. For adapting these domain-specific components, angular margin softmax (AMSoftmax) applied on the CNN-based implementation of the x-vector architecture. In addition, to reduce the problem of over-fitting on the limited target data, transfer learning on the batch norm layers is investigated. Mean shift and covariance estimation of batch norm allows to map the represented components of the target domain to the source domain. Using TDNN and E-TDNN versions of the x-vectors as baseline models, the adapted models on the development set of NIST SRE 2018 outperformed the baselines with relative improvements of 11.0 and 13.8 %, respectively.

</datafield>

<subfield code="a">Sarfjoo_INTERSPEECH_2020/IDIAP</subfield>

</datafield>

<subfield code="a">Supervised domain adaptation for text-independent speaker verification using limited data</subfield>

</datafield>

<subfield code="a">Sarfjoo, Seyyed Saeed</subfield>

</datafield>

<subfield code="a">Madikeri, Srikanth</subfield>

</datafield>

<subfield code="a">Motlicek, Petr</subfield>

</datafield>

<subfield code="a">Marcel, Sébastien</subfield>

</datafield>

<subfield code="a">batch norm</subfield>

</datafield>

<subfield code="a">speaker recognition</subfield>

</datafield>

<subfield code="a">speaker verification</subfield>

</datafield>

<subfield code="a">supervised adaptation</subfield>

</datafield>

<subfield code="a">transfer learning</subfield>

</datafield>

<subfield code="i">EXTERNAL</subfield>

<subfield code="u">http://publications.idiap.ch/attachments/papers/2020/Sarfjoo_INTERSPEECH_2020.pdf</subfield>

<subfield code="x">PUBLIC</subfield>

</datafield>

<subfield code="a">Interspeech</subfield>

</datafield>

</datafield>

</datafield>

<subfield code="u">http://www.interspeech2020.org/uploadfile/pdf/Thu-1-7-4.pdf</subfield>

</datafield>

<subfield code="a">To adapt the speaker verification (SV) system to a target domain with limited data, this paper investigates the transfer learning of the model pre-trained on the source domain data. To that end, layer-by-layer adaptation with transfer learning from the initial and final layers of the pre-trained model is investigated. We show that the model adapted from the initial layers outperforms the model adapted from the final layers. Based on this evidence, and inspired by the works in image recognition field, we hypothesize that low-level convolutional neural network (CNN) layers characterize domain-specific component while high-level CNN layers are domain-independent and have more discriminative power. For adapting these domain-specific components, angular margin softmax (AMSoftmax) applied on the CNN-based implementation of the x-vector architecture. In addition, to reduce the problem of over-fitting on the limited target data, transfer learning on the batch norm layers is investigated. Mean shift and covariance estimation of batch norm allows to map the represented components of the target domain to the source domain. Using TDNN and E-TDNN versions of the x-vectors as baseline models, the adapted models on the development set of NIST SRE 2018 outperformed the baselines with relative improvements of 11.0 and 13.8 %, respectively.</subfield>

</datafield>

</record>

</collection>