The research presented in this thesis addresses the application of deep neural networks and digital signal processing algorithms in the pathological voice detection. In this thesis, the novel methods are presented, including deep acoustic recurrent model that combines frame-based cepstral and spectral features and Bi-directional Long short-term memory (Bi-LSTM) network, a 10-layer convolutional neural network (CNN) model with spectrogram of the speech as input, transfer learning from image recognition applications to pathological voice detection field using timefrequency representation as input, and a novel CNN model using data augmentation idea with scalogram of the speech as input.
The deep acoustic recurrent model explores the relationship of frame-based cepstral features with RNN model. Two novel cepstral features based on cepstrum are proposed: Second Peak Perturbation (SPP) and standard deviation of cepstrum (CepStd). These novels cepstral features are validated to improve the classification performance on three databases. In addition, traditional acoustic analysis is compared with the proposed deep acoustic recurrent model. It is shown that framebased cepstral features shows overall better performance on deep recurrent model than traditional classifiers.
A 10-layer convolutional neural network is proposed with spectrogram of the speech as input. This is the first model that applies time-frequency representation in deep learning for pathological voice detection. The experimental results have shown that it is an effective and efficient model for detecting pathological speech data. However, it shows overfitting problem to some extent. This is a commonly seen problem due to the small data size. In order to address this issue, transfer learning with state-of-the-art CNN networks from image recognition field is applied in the pathological voice detection field. The results shows that transfer learning improves the testing data accuracy. However, the overfitting problem is still severe.
Finally, the concept of data augmentation is explored and a novel CNN model called the R-Net is proposed. This method uses continuous wavelet transform to obtain the scalograms of the speech onset, and data augmentation within a CNN environment. This model significantly reduces the overfitting problems, and improves the testing performance between 15% to 20% on the most challenging SVD database. It validates the efficiency of data augmentation on small-data-size problems.
|Date of Award||15 Apr 2021|
- University Of Strathclyde
|Sponsors||University of Strathclyde|
|Supervisor||John Soraghan (Supervisor) & Anja Lowit (Supervisor)|