New glasses can ‘hear’ what you lip sync — and tell your phone

The lip-reading device enables voice commands without the voice

an african american man is holding a phone to his mouth and speaking into it

Voice commands are a hands-free way to interact with smartphones, but in public places they’re not very private. Soon, people may have the option to go hands-free and voice-free.

Westend61/Getty Images

“Siri, text mom.”

“Alexa, play ‘Flowers’ by Miley Cyrus.”

Voice commands are convenient — unless you’re at a deafening concert, in a quiet library, or you’re unable to use your voice. New frames for eyeglasses that read the wearer’s lips now offer a solution.

Lip-reading involves tracking facial movements to determine what someone is saying. Many lip-reading devices point a camera at the user’s face. Others rely on sensors stuck in or around the speaker’s mouth. Neither approach is suitable for daily use, says Ruidong Zhang. He studies information science at Cornell University in Ithaca, N.Y.

Ruidong Zhang, an Asian man with medium length black hair is putting on a pair of black thick-rimmed glasses (the EchoSpeech prototype)
Ruidong Zhang built the EchoSpeech prototype on an inexpensive, off-the-shelf pair of glasses. On his right, two tiny microphones peek out from underneath the lens. On his left, two speakers hang down a bit more. In future versions, this equipment could be totally hidden inside the frames.Dave Burbank/Cornell University

His team built the new lip-reading tech on a pair of eyeglasses. It uses acoustics — sound — to recognize silent speech. Zhang presented this work April 19 at the ACM Conference on Human Factors in Computing Systems in Hamburg, Germany.

Today, voice commands aren’t private, says Pattie Maes. She’s an expert in human-computer interactions and artificial intelligence (AI). She works at the Massachusetts Institute of Technology in Cambridge. Developing “silent, hands-free and eyes-free approaches” could make digital interactions more accessible while keeping them confidential, she says.

Maes wasn’t involved in the new work, but she has developed other types of silent speech interfaces. She’s eager to see how this one compares in areas such as usability, privacy and accuracy. “I am excited to see this novel acoustic approach,” she says.

Hearing silent speech

“Imagine the sonar system that whales or submarines use,” says Zhang. They send a sound into their environment and listen for echoes. From those echoes, they locate objects in their surroundings.

“Our approach is similar, but not exactly the same,” Zhang explains. “We’re not just interested in locating something. Instead, we’re trying to track subtle moving patterns.”

Zhang calls the new tech EchoSpeech. It consists of two small speakers under one lens of a pair of glasses, two small microphones under the other lens, and a circuit board attached to one of the side arms.

When EchoSpeech is switched on, its speakers play high-pitched sounds. People can’t hear these. But the sound waves still reverberate in every direction. Some travel around the user’s lips and mouth. While speaking, the user’s facial movements change the paths of those sound waves. That, in turn, changes the echo patterns picked up by the microphones.

These patterns are sent to the wearer’s smartphone over Bluetooth. Using AI, an EchoSpeech app then unravels the echo patterns. It matches each pattern to commands that the smartphone then follows.

To test this tech, 24 people took turns wearing the glasses. They gave silent commands while sitting or walking. EchoSpeech performed well in both cases, even with loud background noises. Overall, it was about 95 percent accurate.

The prototype cost less than $100 to build, and Zhang says frames could likely be engineered to hide the electronics in future versions. Need prescription lenses? No problem. Just pop them into the EchoSpeech frames.

In this video, Cornell University researchers demonstrate how EchoSpeech uses facial movements to recognize silent speech. They’re now trying to increase its vocabulary using tools from speech-recognition AI.

Enhancing personal communication

EchoSpeech currently recognizes 31 voice commands, from “play” to “hey, Siri.” It also recognizes numbers that are three to six digits long. But those aren’t limits, Zhang says. He thinks future versions could recognize a much larger vocabulary. “If people can learn to read lips efficiently, then so can AI,” he says.

If so, users could write personal text messages via silent speech. In a noisy restaurant, they could use that approach to send messages to friends who are hard of hearing or far away, instead of trying to yell over the noise or type their words. And those who have lost their voices could participate in conversations face-to-face. Their facial movements could be interpreted in real time and their words texted to their friends’ smartphones.  

EchoSpeech was designed to interpret silent speech, but it might also help recreate voices. People who’ve had their vocal cords removed have been contacting Zhang’s team. They want to know if this interface could read their lips and then speak out loud for them.

He’s now exploring whether EchoSpeech could do this in a person’s own voice. Echo patterns for the same word are slightly different among speakers. The differences could reflect the specific vocal qualities of the speaker, if they can be untangled.   

People without voices often use text-to-voice programs that sound robotic. The message “doesn’t have your emotion, doesn’t have your tone, doesn’t have your speech style,” Zhang notes. Right now, he says, “We’re trying to maintain that information to get an actual living voice.”

This is one in a series presenting news on technology and innovation, made possible with generous support from the Lemelson Foundation.