Notice détaillée

SoundLip

Enabling Word and Sentence-level Lip Interaction for Smart Devices

Article Ecrit par: Zhang, Qian ; Zhao, Run ; Yu, Yinggang ; Wang* , Dong ;

Résumé: As a natural and convenient interaction modality, voice input has now become indispensable to smart devices (e.g. mobile phones and smart appliances). However, voice input is strongly constrained by surroundings and may raise privacy leakage in public areas. In this paper, we present SoundLip, an end-to-end interaction system enabling users to interact with smart devices via silent voice input. The key insight is to use inaudible acoustic signals to capture the lip movements of users when they issue commands. Previous works have considered lip reading as a naive classification task and thus can only recognize individual words. In contrast, our proposed system enables lip reading at both word and sentence levels, which are more suitable for daily-life use. We exploit the built-in speakers and microphones of smart devices to emit acoustic signals and listen to their reflections, respectively. In order to better abstract representations from multi-frequency and multi-modality acoustic signals, we elaborate a hierarchical convolutional neural network (HCNN) to serve as the front-end as well as recognize individual word commands. Then, for the sentence-level recognition, we exploit a multi-task encoder-decoder network to get around temporal segmentation and output sentences in an end-to-end way. We evaluate SoundLip on 20 individual words and 70 sentences from 12 participants. Our system achieves an accuracy of 91.2% at word-level and a word error rate of 7.1% at sentence-level in both user-independent and environment-independent settings. Given its innovative solution and promising performance, we believe that SoundLip has made a significant contribution to the advancement of silent voice input technology.

Langue: Anglais