The devil is in the deeptails – ML3RI: Multi-modal multi-person low-level learning for robot interactions

We are very pleased to organise the seminar series: the devil is in the deeptails.

The main objective of these seminars is to put together researchers to discuss both the scientific ideas and best engineering practices required to push forward high-quality research in deep learning applied to computer vision and audio processing.

In practice, there will be a series of presentations, typically associated to existing and recent research paper(s). These presentations will couple a high-level perspective (given by a not-so-young researcher) with an engineering perspective (given by a young researcher).

To join the mailing list, send an empty e-mail with the following subject: “subscribe deeptails FirstName LastName” to sympa_inria@inria.fr.

If you want to contribute with a seminar, contact us: deeptails-org@inria.fr.

List of Seminars:

Date	Title	Presenters	Materials
June 28th, 2021, 10AM (CET)	Divide and Enhance: Mixtures of Gaussians/Phonemes/Experts in Speech Enhancement	Shlomo Chazan and Sharon Gannot	Slides – Video
Abstract: For decades, most speech enhancement algorithms were model-based, namely an underlying statistical model governed the enhancement procedure, usually, but not solely, applied in the short-time Fourier transform (STFT) domain. In this talk, we explore a line of work that starts with a data-driven approach rooted in statistical modelling, and gradually digging deeper towards a full-fledged deep neural network (DNN) approach that still keeps its tail in speech modelling. An early attempt to harness a data-driven paradigm was proposed by Burshtein and Gannot in 1999. In this contribution, the log-spectrum of the clean speech signal is modelled as a Mixture of Gaussians (MoG). The model parameters are inferred from a clean speech database in an unsupervised manner, using the expectation-maximization procedure. The enhanced speech signal is analytically obtained using the statistical model. The main attribute of this algorithm, nicknamed MixMax, was the ability to apply the most suitable ‘enhancer’ to each speech class. In 2016, we have proposed a hybrid approach, merging the generative MoG model and the discriminative deep neural network (DNN). The unsupervised training procedure was substituted by a phoneme-based classification, using a phoneme-labeled database. Most importantly, a hybrid scheme is proposed by adopting a DNN for the phoneme classification task. The discriminative DNN maintains the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. This concept was further developed in another 2016 paper, completely abandoning the MoG model. The new Mixture of Phonemes MoP) framework comprises a set of phoneme-specific DNNs (pDNNs), together with an additional phoneme-classification DNN (cDNN), responsible for phoneme classification. Concurrently, each of the pDNNs estimates a phoneme-specific speech presence probability (pSPP). Finally, in our recent 2021 paper, recognizing the redundancy of the phoneme structure in speech enhancement tasks, we substitute the phoneme classes by 1a Mixture of Deep Experts (MoDE). Our novel architecture comprises a set of DNNs, each of which is an ‘expert’ in a different spectral pattern, reminiscent of the phoneme structure of speech signals . A gating DNN controls the weights assigned to each expert’s output given a speech segment. The entire network is trained in a self-supervised manner, where each expert is “encourged” to specialize in a different enhancement task. The talk is accompanied by sound examples, demonstrating the ability of the proposed scheme to significantly suppress the noise level while maintaining low speech distortion.

May 28th, 2021, 3PM (CET)	Learning high-level reasoning in vision, language and robotics	Corentin Kervadec and Christian Wolf (Corentin’s co-supervisor with Grigory Antipov and Moez Baccouche, Orange)	Slides 1 – Slides 2 – Video
Abstract: An important sub goal of AI is the creation of intelligent agents, which require high-level reasoning capabilities, situation awareness, and the capacity of robustly taking the right decisions at the right moments. An exact definition of the term reasoning is difficult, but we define it as the opposite of exploiting spurious biases and short-cuts in training data picked up by low-level statistics, and which could lead to dramatic losses in generalization beyond the training data. In this talk we will cover the automatic learning reasoning capabilities through large-scale training of deep neural networks from data, and we target different situations, like robotics and embodied computer vision (robot navigation to solve visual tasks) as well as vision and language reasoning (visual question answering). We explore this problem in a holistic way and study it from various angles of attack: what are the tasks which lead to emergence of reasoning? How can we evaluate agents and measure reasoning vs. bias exploitation? How can we x-ray neural models and visualize their internal behavior? What are the bottlenecks in learning reasoning? Can we structure neural networks with inductive bias to improve the emergence of reasoning?

April 28th, 2021, 3PM (CET)	Speech reconstruction from silent videos using a vocoder	Daniel Michelsanti and Zheng-Hua Tan	Slides – Video – Demo
Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this talk, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over previous video-to-speech approaches.

March 25th, 2021, 3PM (CET)	Towards Observational Imitation Learning	Edoardo Cetin and Oya Celiktutan	Slides 1 – Slides 2 – Video
Abstract: Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities; however, they generally depend on access to a full set of optimal states and actions taken with the agent’s actuators and from the agent’s point of view. In this seminar, we introduce our new algorithm – called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) – with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert’s and the agent’s domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.

January 29th, 2021, 4:30PM (CET)	GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes	Enric Corona and Francesc Moreno-Noguer	Slides 1 – Slides 2 – Video – Paper – Code – Data
Abstract: The rise of deep learning has brought remarkable progress in estimating hand geometry from images where the hands are part of the scene. In this talk we describe a new problem not explored so far, consisting in predicting how a human would grasp one or several objects, given a single RGB image of these objects. This is a problem with enormous potential in e.g. augmented reality, robotics or prosthetic design. In order to predict feasible grasps, we need to understand the semantic content of the image, its geometric structure and all potential interactions with a hand physical model. To this end, we introduce a generative model that jointly reasons in all these levels and 1) regresses the 3D shape and pose of the objects in the scene; 2) estimates the grasp types; and 3) refines the 51-DoF of a 3D hand model that minimize a graspability loss. To train this model we build the YCB-Affordance dataset, that contains more than 133k images of 21 objects in the YCB-Video dataset. We have annotated these images with more than 28M plausible 3D human grasps according to a 33-class taxonomy. A thorough evaluation in synthetic and real images shows that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact.

January 27th, 2021, 11AM (CET)	Towards Generalization Across Depth for Monocular 3D Object Detection	Andrea Simonelli and Elisa Ricci	Slides – Video
Abstract: While expensive LiDAR and stereo camera rigs have enabled the development of successful 3D object detection methods, monocular RGB-only approaches lag much behind. This work advances the state of the art by introducing MoVi-3D, a novel, single-stage deep architecture for monocular 3D object detection. MoVi-3D builds upon a novel approach which leverages geometrical information to generate, both at training and test time, virtual views where the object appearance is normalized with respect to distance. These virtually generated views facilitate the detection task as they significantly reduce the visual appearance variability associated to objects placed at different distances from the camera. As a consequence, the deep model is relieved from learning depth-specific representations and its complexity can be significantly reduced. In particular, in this work we show that, thanks to our virtual views generation process, a lightweight, single-stage architecture suffices to set new state-of-the-art results on the popular KITTI3D benchmark.

July 9th, 2020, 2PM (CET)	How to Train your Deep Multi-Object Tracker	Yihong Xu and Xavier Alameda-Pineda	Slides – Video – Code
Abstract: The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP). As these measures are not differentiable, the choice of appropriate loss functions for end-to-end training of multi-object tracking methods is still an open research problem. In this paper, we bridge this gap by proposing a differentiable proxy of MOTA and MOTP, which we combine in a loss function suitable for end-to-end training of deep multi-object trackers. As a key ingredient, we propose a Deep Hungarian Net (DHN) module that approximates the Hungarian matching algorithm. DHN allows estimating the correspondence between object tracks and ground truth objects to compute differentiable proxies of MOTA and MOTP, which are in turn used to optimize deep trackers directly. We experimentally demonstrate that the proposed differentiable framework improves the performance of existing multi-object trackers, and we establish a new state of the art on the MOTChallenge benchmark. Our code is publicly available here.