Update cookies preferences
 logo Idiap Research Institute        
 [BibTeX] [Marc21]
Advancing Neural Representations for Paralinguistic Analysis: From Speech Emotion to Parkinson?s Disease Assessment
Type of publication: Thesis
Citation: Purohit_THESIS_2026
Year: 2026
Month: February
School: EPFL
Address: Lausanne
URL: https://infoscience.epfl.ch/ha...
DOI: 10.5075/epfl-thesis-11387
Abstract: Paralinguistics comes from the Greek preposition para, meaning "alongside". It refers to the study of information in speech that goes beyond words, capturing cues such as emotion, personality, gender, and health. Rather than focusing on "what" is said, it focuses more on "how" it is said. It examines speech through two complementary temporal dimensions: states, which represent short-term affective variations (such as emotion or stress), and traits, which reflect long-term speaker characteristics, including gender, age, or pathological conditions like Parkinson's disease. Stable traits influence how transient states are expressed and perceived, forming a continuum that underlies paralinguistic analysis. Traditional Speech Emotion Recognition (SER) approaches rely on either (a) suprasegmental modeling of handcrafted acoustic descriptors or (b) direct modeling of long-duration speech signals (typically 4 to 6 seconds) using deep neural networks. This thesis departs from these paradigms by introducing a short-segment modeling strategy, showing that emotion-relevant information can be effectively captured from 250 ms speech waveform segments using an end-to-end Convolutional Neural Network (CNN). Across multiple emotion corpora, the proposed model achieved performance comparable to utterance-level systems and outperformed handcrafted features extracted over the same 250 ms duration. Relevance-signal-based interpretability analysis revealed that the CNN learns emotion-relevant cepstral features, confirming the strength of data-driven short-segment modeling. Building on this, a phonetically aware neural modeling framework was proposed to explore whether phonetic information captures emotional cues. By leveraging neural features that encode phonetic information, this approach consistently outperformed traditional acoustic features across benchmark datasets. These results highlight the importance of phonetic information in emotion modeling. Extending these findings from transient states to persistent traits, the study examined Speech Foundation Models (SFMs) for detecting neurological conditions, focusing on Parkinson's disease (PD). In low-resource clinical scenarios, parameter-efficient adaptation strategies such as layer selection and Low-Rank Adaptation (LoRA) were introduced. We observed that the layer selection method matched the performance of full fine-tuning while requiring significantly fewer parameters. Notably, the application of LoRA to the Whisper model surpassed other methods, suggesting that models pretrained for task-specific speech recognition are conducive to efficient adaptation for PD speech detection. To further explore the interaction between states and traits, the research addressed comorbid depression detection in PD, a challenging task due to overlapping vocal characteristics. In this low-data setting, large SFMs failed to generalize well, whereas interpretable handcrafted acoustic features with robust feature selection proved more effective. Analysis showed that depression manifests through different acoustic markers: while non-PD depression is dominated by source-related features, PD-related depression reflects both source and system cues. Overall, this work advances the understanding of how paralinguistic information is encoded in neural representations, bridging interpretability and scalability toward the development of robust, explainable models for paralinguistic States and Traits inference.
Main Research Program: AI for Everyone
Keywords: comorbid depression , low-resource learning , machine learning, neural representations , paralinguistics , parameter-efficient adaptation , Parkinson’s disease, signal processing, Speech Emotion Recognition, Speech Foundation Models, transfer learning
Authors: Purohit, Tilak
Added by: [UNK]
Total mark: 0
Attachments
  • Purohit_THESIS_2026.pdf
Notes