CONF
ajmera2001art/IDIAP
Robust HMM-Based Speech/Music Segmentation
Ajmera, Jitendra
McCowan, Iain A.
Bourlard, Hervé
EXTERNAL
https://publications.idiap.ch/attachments/reports/2001/ajmera2002icassp.pdf
PUBLIC
https://publications.idiap.ch/index.php/publications/showcite/ajmera-rr-01-33
Related documents
ICASSP
2002
Orlando, Florida
1746-1749
IDIAP-RR 01-33
In this paper we present a new approach towards high performance speech/music segmentation on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, the local probability density function (PDF) estimators trained on clean microphone speech are used as a channel model at the output of which the entropy and ``dynamism'' will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. The parameters of the HMM are trained using the EM algorithm in a completely unsupervised manner. Different experiments, including a variety of speech and music styles, as well as different segment durations of speech and music signals (real data distribution, mostly speech, or mostly music,',','),
will illustrate the robustness of the approach, which in each case achieves a frame-level accuracy greater than 94\%.
REPORT
ajmera-rr-01-33/IDIAP
Robust HMM-Based Speech/Music Segmentation
Ajmera, Jitendra
McCowan, Iain A.
Bourlard, Hervé
EXTERNAL
https://publications.idiap.ch/attachments/reports/2001/rr01-33.pdf
PUBLIC
Idiap-RR-33-2001
2001
IDIAP
Martigny, Switzerland
ICASSP,Orlando, Florida, 2002
In this paper we present a new approach towards high performance speech/music segmentation on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, the local probability density function (PDF) estimators trained on clean microphone speech are used as a channel model at the output of which the entropy and ``dynamism'' will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. The parameters of the HMM are trained using the EM algorithm in a completely unsupervised manner. Different experiments, including a variety of speech and music styles, as well as different segment durations of speech and music signals (real data distribution, mostly speech, or mostly music,',','),
will illustrate the robustness of the approach, which in each case achieves a frame-level accuracy greater than 94\%.