Extracting audio from visual information

Submitted by Koen on Tuesday 12 August 2014

I'm not going to copy the whole article, but this is the intro:

"Researchers at MIT, Microsoft, and Adobe have developed an algorithm that can reconstruct an audio signal by analyzing minute vibrations of objects depicted in video. In one set of experiments, they were able to recover intelligible speech from the vibrations of a potato-chip bag photographed from 15 feet away through soundproof glass. In other experiments, they extracted useful audio signals from videos of aluminum foil, the surface of a glass of water, and even the leaves of a potted plant."

You can read the full article here.

Of course, similar things have been possible for many years, using so-called "laser microphones".
Here, lasers are used to measure the subtle movements of window glass in response to sound being produced inside a room (see video below showing how laser beams can carry sound signals and how that can be used for intercepting sounds in rooms with windows). There are also devices that use a laser beam and smoke or vapor to detect sound vibrations in free air: sound pressure waves cause disturbances in the smoke that in turn cause variations in the amount of laser light reaching a photo detector.

But it's interesting that this technique here uses "normal" high-speed camera's. The researchers actually also did experiments with standard 60 fps video recordings, and although "audio reconstruction wasn’t as faithful as that with the high-speed camera, it may still be good enough to identify the gender of a speaker in a room; the number of speakers; and even, given accurate enough information about the acoustic properties of speakers’ voices, their identities."

Apparently they are making use of techniques from earlier work on algorithms that amplify very small variations in video images (see here for more info on that), which was also used to detect heart rate from video images without any contact devices:

Goes to show that video and sound are actually more closely connected to each other then one would think ;-)

technology

audio

visuals

Extracting audio from visual information

Company

Blog