The devil is in the deeptails

We are very pleased to organise the seminar series: the devil is in the deeptails.

The main objective of these seminars is to put together researchers to discuss both the scientific ideas and best engineering practices required to push forward high-quality research in deep learning applied to computer vision and audio processing.

In practice, there will be a series of presentations, typically associated to existing and recent research paper(s). These presentations will couple a high-level perspective (given by a not-so-young researcher) with an engineering perspective (given by a young researcher).

To join the mailing list, send an empty e-mail with the following subject: “subscribe deeptails FirstName LastName” to

If you want to contribute with a seminar, contact us:

List of Seminars:

Date Title Presenters Materials
April 28th, 2021, 3PM (CET) Speech reconstruction from silent videos using a vocoder Daniel Michelsanti and Zheng-Hua Tan Slides – Video
Abstract: Both acoustic and visual information influence human perception of speech. For this reason, the lack of audio in a video sequence determines an extremely low speech intelligibility for untrained lip readers. In this talk, we present a way to synthesise speech from the silent video of a talker using deep learning. The system learns a mapping function from raw video frames to acoustic features and reconstructs the speech with a vocoder synthesis algorithm. To improve speech reconstruction performance, our model is also trained to predict text information in a multi-task learning fashion and it is able to simultaneously reconstruct and recognise speech in real time. The results in terms of estimated speech quality and intelligibility show the effectiveness of our method, which exhibits an improvement over previous video-to-speech approaches.
March 25th, 2021, 3PM (CET) Towards Observational Imitation Learning Edoardo Cetin and Oya Celiktutan Slides 1Slides 2Video
Abstract: Human beings are able to understand objectives and learn by simply observing others perform a task. Imitation learning methods aim to replicate such capabilities; however, they generally depend on access to a full set of optimal states and actions taken with the agent’s actuators and from the agent’s point of view. In this seminar, we introduce our new algorithm – called Disentangling Generative Adversarial Imitation Learning (DisentanGAIL) – with the purpose of bypassing such constraints. Our algorithm enables autonomous agents to learn directly from high dimensional observations of an expert performing a task, by making use of adversarial learning with a latent representation inside the discriminator network. Such latent representation is regularized through mutual information constraints to incentivize learning only features that encode information about the completion levels of the task being demonstrated. This allows to obtain a shared feature space to successfully perform imitation while disregarding the differences between the expert’s and the agent’s domains. Empirically, our algorithm is able to efficiently imitate in a diverse range of control problems including balancing, manipulation and locomotive tasks, while being robust to various domain differences in terms of both environment appearance and agent embodiment.
January 29th, 2021, 4:30PM (CET) GanHand: Predicting Human Grasp Affordances in Multi-Object Scenes Enric Corona and Francesc Moreno-Noguer Slides 1Slides 2VideoPaperCodeData
Abstract: The rise of deep learning has brought remarkable progress in estimating hand geometry from images where the hands are part of the scene. In this talk we describe a new problem not explored so far, consisting in predicting how a human would grasp one or several objects, given a single RGB image of these objects. This is a problem with enormous potential in e.g. augmented reality, robotics or prosthetic design. In order to predict feasible grasps, we need to understand the semantic content of the image, its geometric structure and all potential interactions with a hand physical model. To this end, we introduce a generative model that jointly reasons in all these levels and 1) regresses the 3D shape and pose of the objects in the scene; 2) estimates the grasp types; and 3) refines the 51-DoF of a 3D hand model that minimize a graspability loss. To train this model we build the YCB-Affordance dataset, that contains more than 133k images of 21 objects in the YCB-Video dataset. We have annotated these images with more than 28M plausible 3D human grasps according to a 33-class taxonomy. A thorough evaluation in synthetic and real images shows that our model can robustly predict realistic grasps, even in cluttered scenes with multiple objects in close contact.
January 27th, 2021, 11AM (CET) Towards Generalization Across Depth for Monocular 3D Object Detection Andrea Simonelli and Elisa Ricci SlidesVideo
Abstract: While expensive LiDAR and stereo camera rigs have enabled the development of successful 3D object detection methods, monocular RGB-only approaches lag much behind. This work advances the state of the art by introducing MoVi-3D, a novel, single-stage deep architecture for monocular 3D object detection. MoVi-3D builds upon a novel approach which leverages geometrical information to generate, both at training and test time, virtual views where the object appearance is normalized with respect to distance. These virtually generated views facilitate the detection task as they significantly reduce the visual appearance variability associated to objects placed at different distances from the camera. As a consequence, the deep model is relieved from learning depth-specific representations and its complexity can be significantly reduced. In particular, in this work we show that, thanks to our virtual views generation process, a lightweight, single-stage architecture suffices to set new state-of-the-art results on the popular KITTI3D benchmark.
July 9th, 2020, 2PM (CET) How to Train your Deep Multi-Object Tracker Yihong Xu and Xavier Alameda-Pineda Slides – VideoCode
Abstract: The recent trend in vision-based multi-object tracking (MOT) is heading towards leveraging the representational power of deep learning to jointly learn to detect and track objects. However, existing methods train only certain sub-modules using loss functions that often do not correlate with established tracking evaluation measures such as Multi-Object Tracking Accuracy (MOTA) and Precision (MOTP). As these measures are not differentiable, the choice of appropriate loss functions for end-to-end training of multi-object tracking methods is still an open research problem. In this paper, we bridge this gap by proposing a differentiable proxy of MOTA and MOTP, which we combine in a loss function suitable for end-to-end training of deep multi-object trackers. As a key ingredient, we propose a Deep Hungarian Net (DHN) module that approximates the Hungarian matching algorithm. DHN allows estimating the correspondence between object tracks and ground truth objects to compute differentiable proxies of MOTA and MOTP, which are in turn used to optimize deep trackers directly. We experimentally demonstrate that the proposed differentiable framework improves the performance of existing multi-object trackers, and we establish a new state of the art on the MOTChallenge benchmark. Our code is publicly available here.

Comments are closed.