[Seminar] DNN-based Algorithms for Audio Processing in Reverberant Environments by Prof. Sharon Gannot – Audio-visual machine perception & interaction for robots

Speaker: Prof. Sharon Gannot, Bar-Ilan University

When/where: 10h30 (CET) 16, March, 2023 @ Grand Amphi at Inria Grenoble (655 Avenue de l’Europe, 38334 Montbonnot)

Abstract: In recent years, deep neural networks (DNNs) have been widely employed in various audio processing tasks. In this talk, we will explore recent works developed by our team, that address two important audio processing tasks in adverse acoustic conditions: speech dereverberation and speaker extraction.

The first part of the talk presents dereverberation algorithms. We start by exploring a new paradigm for single-microphone speech dereverberation. Motivated by the recent success of the fully-convolutional networks (FCNs) in many image processing applications, we investigate their applicability to enhance the speech signal represented by short-time Fourier transform (STFT) images. Specifically, we present a Unet architecture, which is an encoder-decoder network with skip connections. We then extend this structure to the multi-microphone setting. While most existing DNN architectures can only deal with fixed and position-specific microphone arrays, in this work we present a DNN architecture that can be applied in scene-agnostic problems, namely when the number of microphones is unknown and the array onstellation is arbitrary. To this end, our approach harnesses recent advances in deep learning on set-structured data to design an architecture that enhances the reverberant log-spectrum. We demonstrate the performance of the method in adverse reverberant conditions.

The second part of the talk is dedicated to the speaker extraction problem. We present a Siamese-Unet architecture for single-microphone speaker extraction in clean and noisy conditions. The Siamese encoders are applied in the frequency domain to infer the embedding of the noisy and reference spectra, respectively. The concatenated representations are then fed into the decoder to estimate the real and imaginary components of the desired speaker, which are then inverse-transformed to the time domain. The model is trained with the scale-invariant signal-to-distortion ratio (SI-SDR) loss to utilize the time domain information. The time-domain loss is also regularized with frequency-domain loss to preserve speech patterns. This scheme outperforms state-of-the-art methods in low reverberation levels. If time permits, we will also present a new extension to the speaker extraction algorithm. The new scheme comprises two modules, one focuses on the extraction task and the second focuses on dereverberation and residual interference and noise reduction task. We demonstrate the applicability of the new scheme in high reverberation levels.

The talk is accompanied by audio examples demonstrating the performance of the discussed methods.

Biography: Sharon Gannot received the B.Sc. degree (summa cum laude) from the Technion-Israel Institute of Technology, Haifa, Israel, in 1986, and the M.Sc. (cum laude) and Ph.D. degrees from Tel-Aviv University, Tel Aviv, Israel, in 1995 and 2000, respectively, all in electrical engineering. In 2001, he held a Postdoctoral position with the Department of Electrical Engineering, KU Leuven, Leuven, Belgium. From 2002 to 2003, he held a Research and Teaching position with the Faculty of Electrical Engineering, Technion-Israel Institute of Technology. He is currently a Full Professor at the Faculty of Engineering, Bar-Ilan University, Israel, where he is heading the Acoustic Signal Processing Laboratory and the Data Science Program. He also serves as the Faculty Vice Dean.

Dr. Gannot has co-authored more than 300 publications in journals, conference proceedings, and book chapters. His research interests include statistical signal processing and machine learning algorithms with applications to single- and multi-microphone speech processing. He was also selected to present tutorials and keynote addresses at many of the leading conferences in the field.

Dr. Gannot took many editorial responsibilities in leading journals in the field, including Senior Area Chair for the IEEE Transactions on Audio, Speech, and Language Processing, 2013–2017, and since 2020. He served as the Chair of the Audio and Acoustic Signal Processing (AASP) technical committee of the IEEE SPS, 2017-2018. Currently, he is the chair of the Data Science Initiative of IEEE SPS. He also served as the General Co-Chair of IWAENC 2010 and of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013, and will serve as the general co-chair for Interspeech to be held in Jerusalem, Israel in 2024.

Dr. Gannot is a co-recipient of thirteen best paper awards and a recipient of The 2022 EURASIP Group Technical Achievement Award. He is an IEEE Fellow for contributions to acoustical modeling and statistical learning in speech enhancement (Class 2021).