ARTICLE ajmeraspcom/IDIAP Speech/Music Discrimination using Entropy and Dynamism Features in a HMM Classification Framework Ajmera, Jitendra McCowan, Iain A. Bourlard, Hervé EXTERNAL https://publications.idiap.ch/attachments/reports/2001/rr01-26.pdf PUBLIC https://publications.idiap.ch/index.php/publications/showcite/ajmera-rr-01-26 Related documents Speech Communication 40 351-363 2003 IDIAP-RR 01-26 In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, the (local) Probability Density Function (PDF) estimators trained on clean, microphone, speech (as used in a standard large vocabulary speech recognition system) are used as a channel model at the output of which the entropy and ``dynamism'' will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. Indeed, in the case of entropy, it is clear that, on average, the entropy at the output of the local PDF estimators will be larger for speech signals than non-speech signals presented at their input. In our case, local probabilities will be estimated from an multilayer perceptron (MLP) as used in hybrid HMM/MLP systems, thus guaranteeing the use of ``real'' probabilities in the estimation of the entropy. The 2-state speech/non-speech HMM will thus take these two dimensional features (entropy and ``dynamism'') whose distributions will be modeled through (two-dimensional) multi-Gaussian densities or an MLP, whose parameters are trained through a Viterbi algorithm.\\ Different experiments, including different speech and music styles, as well as different (a priori) distributions of the speech and music signals (real data distribution, mostly speech, or mostly music,',','), will illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90\%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures. REPORT ajmera-rr-01-26/IDIAP Speech/Music Discrimination using Entropy and Dynamism Features in a HMM Classification Framewor Ajmera, Jitendra McCowan, Iain A. Bourlard, Hervé EXTERNAL https://publications.idiap.ch/attachments/reports/2001/rr01-26.pdf PUBLIC Idiap-RR-26-2001 2001 IDIAP Martigny, Switzerland Speech Communication, Vol 40, 2003 In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, the (local) Probability Density Function (PDF) estimators trained on clean, microphone, speech (as used in a standard large vocabulary speech recognition system) are used as a channel model at the output of which the entropy and ``dynamism'' will be measured and integrated over time through a 2-state (speech and and non-speech) hidden Markov model (HMM) with minimum duration constraints. Indeed, in the case of entropy, it is clear that, on average, the entropy at the output of the local PDF estimators will be larger for speech signals than non-speech signals presented at their input. In our case, local probabilities will be estimated from an multilayer perceptron (MLP) as used in hybrid HMM/MLP systems, thus guaranteeing the use of ``real'' probabilities in the estimation of the entropy. The 2-state speech/non-speech HMM will thus take these two dimensional features (entropy and ``dynamism'') whose distributions will be modeled through (two-dimensional) multi-Gaussian densities or an MLP, whose parameters are trained through a Viterbi algorithm.\\ Different experiments, including different speech and music styles, as well as different (a priori) distributions of the speech and music signals (real data distribution, mostly speech, or mostly music,',','), will illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90\%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures.