Looking to Listen: Audio-Visual Speech Separation

“We present a model for isolating and enhancing the speech of desired speakers in a video.  The input is a video (frames + audio track) with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise.  Both audio and visual features are extracted and fed into a joint audio-visual speech separation model. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video. This allows us to then compose videos where speech of specific people is enhanced while all other sound is suppressed. Our model was trained using thousands of hours of video segments from our new dataset, AVSpeech, which we plan to release publicly. ”

Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation“.

 

 

LEAVE A REPLY

Please enter your comment!
Please enter your name here

This site uses Akismet to reduce spam. Learn how your comment data is processed.