Unsupervised Audio Source Separation Using Differentiable Parametric Source Models by Prof. Gaël Richard

Gaël Richard received the State Engineering degree from Telecom Paris, France in 1990, the Ph.D. degree and Habilitation from University of Paris-Saclay respectively in 1994 and 2001. After the Ph.D. degree, he spent two years at Rutgers University, Piscataway, NJ, in the Speech Processing Group of Prof. J. Flanagan, where he explored innovative approaches for speech production. From 1997 to 2001, he successively worked for Matra, Bois d’Arcy, France, and for Philips, Montrouge, France. He then joined Telecom Paris, where he is now a Full Professor in audio signal processing. He is also the executive director of the Hi! PARIS interdisciplinary center on Artificial Intelligence and Data analytics. He is a coauthor of over 250 papers and inventor in 11 patents. His research interests are mainly in the field of speech and audio signal processing and include topics such as signal representations, source separation, machine learning methods for audio/music signals and music information retrieval. He received, in 2020, the Grand prize of IMT-National academy of science for his research contribution in sciences and technologies. He is a fellow member of the IEEE and the current Chair of the IEEE SPS Technical Committee for Audio and Acoustic Signal Processing. In 2022, he is awarded of an advanced ERC grant of the European Union for a project on machine listening and artificial intelligence for sound.

Title: Unsupervised Audio Source Separation Using Differentiable Parametric Source Models

Abstract: Supervised deep learning approaches to underdetermined audio source separation achieve state-of-the-art performance but require a dataset of mixtures along with their corresponding isolated source signals. Such datasets can be extremely costly to obtain for musical mixtures. This raises a need for unsupervised methods. We propose a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a differentiable parametric source-filter model. A neural network is trained to reconstruct the observed mixture as a sum of the sources by estimating the source models’ parameters given their fundamental frequencies. At test time, soft masks are obtained from the synthesized source signals. The experimental evaluation on a vocal ensemble separation task shows that the proposed method outperforms learning-free methods based on nonnegative matrix factorization and a supervised deep learning baseline. Integrating domain knowledge in the form of source models into a data-driven method leads to high data efficiency: the proposed approach achieves good separation quality even when trained on less than three minutes of audio. This work makes powerful deep learning based separation usable in scenarios where training data with ground truth is expensive or nonexistent.

Preprint: https://arxiv.org/pdf/2201.09592.pdf

Comments are closed.