Deep learning applied to speech processingdevelopment of novel models and techniques

  1. Carofilis Vasco, Roberto Andrés
Supervised by:
  1. Enrique Alegre Gutiérrez Director
  2. Laura Fernández Robles Director

Defence university: Universidad de León

Fecha de defensa: 20 December 2023

Committee:
  1. Luis Fernando d'Haro Enríquez Chair
  2. Víctor González Castro Secretary
  3. Kenneth Camilleri Committee member

Type: Thesis

Abstract

This thesis proposes and evaluates new machine learning techniques and models for different tasks in the field of speech processing. It mainly addresses the identification of speakers, languages, and accents using several descriptor proposals based on different sound representations. In addition, it presents a new transfer learning technique based on a new descriptor, and two new architectures for deep learning models based on complementary audio representations. The new transfer learning technique is based on a descriptor we call Grad-Transfer, which is based on the model interpretability method Gradient-weighted Class Activation Mapping (Grad-CAM). Grad-CAMgenerates a heatmap of the most relevant zones in the input data according to their influence on a given model prediction. For the development of Grad-Transfer, we experimentally demonstrate, using Birch and k-means clustering algorithms, that the heat maps generated by the Grad-CAM method are able to store part of the knowledge acquired by a deep learning speech processing model fed by spectrograms during its training process. We exploited this capability of Grad-CAM to formulate a new technique that transfers knowledge from a pre-trained model to an untrained one, through the Grad-Transfer descriptor, which is responsible for summarizing and reusing such knowledge. Several Grad-Transfer-basedmodels were evaluated for the accent identification task using the Voice Cloning Toolkit dataset. These models include Gaussian Naive Bayes, Support Vector Machines, and Passive Aggressive classifiers. Experimental results show an increase in performance of up to 23.58% in models fed by Grad-Transfer descriptors and spectrograms compared to models fed by spectrograms alone. This demonstrates the ability of Grad-Transfer to improve the performance of speech processing models and opens the door to new implementations for similar tasks. On the other hand, new transfer learning approaches based on embedding generation models were evaluated. Embeddings are generated by machine learning models trained for a specific task on large datasets. By exploiting the knowledge already acquired, these models can be reused for new tasks where the amount of available data is small. This thesis proposes a new architecture for deep learning models, called Mel and Wave Embeddings for Human Voice Tasks (MeWEHV), capable of generating robust embeddings for speech processing. MeWEHV combines embeddings generated by a pretrained wave encoder model fed with raw audio and deep features extracted fromMel Frequency Cepstral Coefficients (MFCCs) using convolutional neural networks. We demonstrated the complementarity between the two representations and exploited it through neural layers specifically designed for their combination. We evaluated the performance of MeWEHV on three tasks: language identification, accent identification, and speaker identification. For the first task, we used the VoxForge and Common Language datasets. For the accent identification task, we used the Latin American Spanish Corpora and Common Voice datasets. Finally, for the speaker identification task, we used the VoxCeleb1 dataset and created YouSpeakers204, a new publicly available dataset for English speaker identification. YouSpeakers204 contains 19607 audio clips from 204 speakers with six different accents, allowing other researchers to work with a highly balanced dataset and build new models that are robust to multiple accents. This approach significantly improved the performance of the most advanced state-of-the-art models in all evaluated datasets, obtaining improvements of up to 88.27% in speaker identification, 14.86% in language identification, and 20.38% in accent identification. This was achieved at a low additional computational cost, with only 1.04M additional parameters, which represents between 0.33% and 1.09% more parameters than the pre-trained models used as a baseline. In addition, a second architecture based on embedding generation models, called Squeeze-and-excitation for Embeddings Network (SaEENet), is proposed. SaEENet employs 1D depthwise separable convolution layers, GRU layers, and introduces, for the first time, the use of squeeze-and-excitation blocks for audio embedding weighting. The use of squeeze-and-excitation allows the model to assign a higher or lower relevance to each embedding generated from small audio segments, thus discarding information generated from voiceless segments or segments with non-relevant information. Furthermore, for the same architecture, we present experimental results using three different variations of squeeze-and-excitation blocks, identifying the most useful ones for the evaluated tasks. SaEENet outperforms MeWEHV and similar state-of-the-art models in the tasks of language identification, accent identification and speaker identification, achieving improvements of up to 0.90%, 1.41% and 4.01%, respectively, with 31.73% fewer trainable parameters than MEWHEV. Overall, this thesis involves several advances in the areas of speaker, language, and accent identification, and proposes new techniques andmodels that use transfer learning to improve the performance of the state-of-the-art models evaluated.