Scope – Audio-visual machine perception & interaction for robots

In short

Enabling socially-aware robot behavior for interactions with humans. Emphasis on unsupervised and weakly supervised learning with audio-visual data, Bayesian inference, deep learning, and reinforcement learning. Challenging proof-of-concept demonstrators.

Scientific objectives and context

Develop robots that explore populated spaces, understand human behavior, engage multimodal dialog with several users, etc. These tasks require audio and visual cues (e.g. clean speech signals, eye-gaze, head-gaze, facial expressions, lip movements, head movements, hand and body gestures) to be robustly retrieved from the raw sensor data. These features cannot be reliably extracted with a static robot that listens, looks and communicates with people from a distance, because of acoustic reverberation and noise, overlapping audio sources, bad lighting, limited image resolution, narrow camera field of view, visual occlusions, etc. We will investigate audio and visual perception and communication, e.g. face-to-face dialog: the robot should learn how to collect clean data (e.g. frontal faces, signals with high speech-to-noise ratios) and how to react appropriately to human verbal and non-verbal solicitations. We plan to demonstrate these skills with a companion robot that assists and entertains the elderly in healthcare facilities.

Research directions

We propose to investigate the coupling between dynamic Bayesian networks (DBNs) and deep neural networks (DNNs). The rationale is to combine the flexibility of DBNs to model complex variable dependencies, with the ability of DNNs to learn goal- and data-driven representations. Nevertheless, DBNs with heterogeneous (continuous and discrete) latent variables and with intermittently available observed variables (possibly living in different mathematical spaces) are computationally intractable. We therefore plan to primarily investigate variational approximations of DBNs and to thoroughly study tractable and time-efficient solvers for the joint estimation of DBN parameters and of DNN weights. Data-driven robot behaviors and human-robot interactions (HRIs) will be learned online in the framework of deep reinforcement learning (DRL). The task of learning optimal decisions based on sequences of observations, actions and rewards and in the presence of complex latent spaces, is far from being trivial, and it requires very large annotated datasets. As human annotation is seemingly infeasible for online HRI, we plan to develop learning from simulated interactions based on data generation methods.

Motivation and expected impact

We propose to go well beyond state-of-the-art human-machine communication techniques that use hand-held, hands-free, or table-top devices and that assume single-source speech signals and close-range frontal views of people. To date, these techniques are unable to address challenging perception and interaction problems, e.g. with who, when, where, and how should the robot hold a conversation? The coupling between high-level decision taking, based on DBNs, DNNs, and DRL, and low-level sensor-based robot control, using audio and visual feedbacks, is completely novel.