<?xml version="1.0" encoding="UTF-8"?>
<collection xmlns="http://www.loc.gov/MARC21/slim">
	<record>
		<datafield tag="980" ind1=" " ind2=" ">
			<subfield code="a">REPORT</subfield>
		</datafield>
		<datafield tag="970" ind1=" " ind2=" ">
			<subfield code="a">Kabil_Idiap-RR-10-2022/IDIAP</subfield>
		</datafield>
		<datafield tag="245" ind1=" " ind2=" ">
			<subfield code="a">SPARSE AUTOENCODERS TO ENHANCE SPEECH RECOGNITION</subfield>
		</datafield>
		<datafield tag="700" ind1=" " ind2=" ">
			<subfield code="a">Kabil, Selen Hande</subfield>
		</datafield>
		<datafield tag="700" ind1=" " ind2=" ">
			<subfield code="a">Bourlard, Hervé</subfield>
		</datafield>
		<datafield tag="653" ind1="1" ind2=" ">
			<subfield code="a">Deep neural network</subfield>
		</datafield>
		<datafield tag="653" ind1="1" ind2=" ">
			<subfield code="a">high-dimensional sparse representations</subfield>
		</datafield>
		<datafield tag="653" ind1="1" ind2=" ">
			<subfield code="a">sparse autoencoder</subfield>
		</datafield>
		<datafield tag="653" ind1="1" ind2=" ">
			<subfield code="a">speech recognition</subfield>
		</datafield>
		<datafield tag="856" ind1="4" ind2="0">
			<subfield code="i">EXTERNAL</subfield>
			<subfield code="u">http://publications.idiap.ch/attachments/reports/2020/Kabil_Idiap-RR-10-2022.pdf</subfield>
			<subfield code="x">PUBLIC</subfield>
		</datafield>
		<datafield tag="088" ind1=" " ind2=" ">
			<subfield code="a">Idiap-RR-10-2022</subfield>
		</datafield>
		<datafield tag="260" ind1=" " ind2=" ">
			<subfield code="c">2022</subfield>
			<subfield code="b">Idiap</subfield>
		</datafield>
		<datafield tag="771" ind1="2" ind2=" ">
			<subfield code="d">August 2022</subfield>
		</datafield>
		<datafield tag="500" ind1=" " ind2=" ">
			<subfield code="a">Submitted to ICASSP 2021</subfield>
		</datafield>
		<datafield tag="520" ind1=" " ind2=" ">
			<subfield code="a">Starting from a state-of-the-art speech recognition system as baseline, we investigate the use of sparse autoencoders to fur- ther process the baseline acoustic model outputs (senone likelihoods) to obtain even more reliable senone and phone likelihoods with better classification performance. Inspired by dictionary learning, we use shallow overcomplete sparse autoencoders to project the senone likelihoods to high-dimensional sparse space, which is expected to increase the separability among different speech units while suppressing the distortions. On Augmented Multiparty Interaction (AMI) dataset with both Individual Head-mounted Microphone (IHM) and Single-Distant Microphone (SDM) conditions, we show that classifiers trained on the high-dimensional sparse representations from the sparse autoencoders improve senone and phone classification performance compared to the baseline, especially in noisy (SDM) condition. As for the word recognition, we obtain promising improvements when weighted combi- nation of the senone likelihoods from the baseline acoustic model and from our senone classifier is used for decoding.</subfield>
		</datafield>
	</record>
</collection>