Speech recognition is now used in many applications, such as virtual assistants which collect, process and store personal speech data in centralized servers, raising serious concerns regarding the privacy of their users. Embedded speech recognition frameworks have recently been introduced to address privacy issues during the recognition phase: in this case, a (pre-trained) speech recognition model is shipped to the user’s device so that the processing can be done locally without the user sharing its data. However, speech recognition technology still has limited performance in adverse conditions (e.g., noisy environments, reverberated speech, strong accents, etc) and there is a need for performance improvement. This can only be achieved using large speech corpora that are representative of the actual users and of the various usage conditions. There is therefore a strong need to share speech data for improved training that is beneficial to all users, while keeping the speaker identity and voice characteristics private. It is also becoming clear that the user should have better control over its data, so that he/she can decide not to transmit data whose semantic content is sensitive.
In this context, the ANR project DEEP-PRIVACY proposes a new paradigm based on a distributed, personalized, and privacy-preserving approach for speech processing, with a focus on machine learning algorithms for speech recognition. To this end, we propose to rely on a hybrid approach: the device of each user does not share its raw speech data and runs some private computations locally, while some cross-user computations are done by communicating through a server (or a peer-to-peer network). To satisfy privacy requirements at the acoustic level, the information communicated to the server should not expose sensitive speaker information. The project addresses the above challenges from the theoretical, methodological and empirical standpoints through two major scientific objectives.
The first objective is to learn privacy-preserving representations of the speech signal that disentangle the features that expose private information (to be kept on the user’s device) and generic features useful for the task of interest (which satisfy some notion of privacy and can thus be shared with servers). For speech recognition, these representations correspond respectively to speaker-specific information (to be protected) and phonetic / linguistic information (to be shared) carried by the speech signals. We will explore several directions, all based on deep learning approaches, and, besides traditional speech and speaker recognition measures, we will also use some formal notion of privacy to assess their performance.
The second objective concerns distributed algorithms and personalization, through the design of efficient distributed algorithms which operate under the setting where sensitive user data is kept on-device, with global components running on servers and personalized components running on personal devices. The personalized components allow for better speaker-adapted processing and recognition. Data transferred to servers should contain useful information for learning / updating global components (here speech recognition models), while preserving user privacy. We will study the convergence guarantees of distributed training algorithms and investigate how much speaker information is carried out by the information exchanged during training. Moreover, personalized components allow for introducing speaker-specific transforms and adapting some model parameters to the speaker. We will also consider a peer-to-peer framework, as an alternative to servers, for data sharing and model training.