Learning and controlling the source-filter representation of speech with a variational autoencoder by Prof. Simon Leglaive – Audio-visual machine perception & interaction for robots

Simon Leglaive is a tenured Assistant Professor at CentraleSupélec and a researcher in the AIMAC team of the IETR laboratory, a CNRS joint research unit in Rennes, France. He received the Engineering degree from Télécom Paris (Paris, France) and the M.Sc. degree in acoustics, signal processing and computer science applied to music (ATIAM) from Sorbonne University (Paris, France) in 2014. He received the Ph.D. degree from Télécom Paris in the field of audio signal processing in 2017. He was then a post-doctoral researcher at Inria Grenoble Rhône-Alpes, in the Perception team. His research focuses on audio signal processing and machine learning. He is mainly interested in Bayesian approaches for problems that consist in estimating latent signals from noisy and/or incomplete observations. His recent work focuses on weakly-supervised methods with deep-learning-based generative models, essentially dynamical variational autoencoders.

Title: Learning and controlling the source-filter representation of speech with a variational autoencoder

Abstract: Understanding and controlling latent representations in deep generative models is a challenging yet important problem for analyzing, transforming and generating various types of data. In speech processing, inspiring from the anatomical mechanisms of phonation, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors, among which the fundamental frequency f0 and the formants are of primary importance. In this work, we show that the source-filter model of speech production naturally arises in the latent space of a variational autoencoder (VAE) trained in an unsupervised manner on a dataset of natural speech signals. Using only a few seconds of labeled speech signals generated with an artificial speech synthesizer, we experimentally illustrate that f0 and the formant frequencies are encoded in orthogonal subspaces of the VAE latent space and we develop a weakly-supervised method to accurately and independently control these speech factors of variation within the learned latent subspaces. Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms that is conditioned on f0 and the formant frequencies, and which is applied to the transformation of speech signals.

Preprint: https://arxiv.org/pdf/2204.07075.pdf