Google is pulling the capabilities of real-time audio transcription into an app, thanks to big data resources of the cloud and real-time analytics.
Automated, real-time transcription technology has been around for some time, introduced to the mass market several decades ago by companies such as Dragon Systems. Open and closed captioning systems have been seen on television programs as far back as 1973.
Such technology has been nothing short of revolutionary for the daily lives of deaf or hard-of-hearing people. The challenge with such systems is they are not readily available for more informal or spontaneous discussions. Google has been active in this space, and recently unveiled technology that promises to extend real-time transcription technology to everyday life.
The service, Live Transcribe, is a free Android service that makes real-world conversations more accessible by bringing the power of automatic captioning into every day, conversational use. The service runs on Google Cloud and captions conversations in real-time, supporting over 70 languages and more than 80 percent of the world’s population. “You can launch it with a single tap from within any app, directly from the accessibility icon on the system tray,” relates Sagar Savla, product manager for machine perception at Google in a recent post.
The user experience (UX) was an important part of the project, he relates. Google partnered with Gallaudet University to kickstart user experience research collaborations that would ensure core user needs were satisfied while maximizing the potential of our technologies. Hardware was the first consideration, with the decision to develop a smartphone app, due to the “sheer ubiquity of these devices and the increasing capabilities they have.”
Confidence in the transcription being delivered is another critical UX factor. For example, as anyone who has watched closed captioning can attest, sometimes words and phrases are twisted and are displayed as confusing dialog. The team considered but abandoned the idea of a color-coded text system for displaying confidence, as this would be a distraction, Savla says. “Instead, Live Transcribe focuses on better presentation of the text and supplementing it with other auditory signals besides speech.”
Dealing with noisy rooms or environments was another challenge to the development team. “Known as the cocktail party problem, understanding a speaker in a noisy room is a major challenge for computers,” says Savla. To address this, the team added an indicator consisting of two concentric circles that helps visualize the volume of user speech relative to background noise — with an inner circle representing the environmental noise level, and an outer circle representing the fidelity of the speaker’s voice. “This gives users instant feedback on how well the microphone is receiving the incoming speech from the speaker, allowing them to adjust the placement of the phone,” he adds.
Network latency is another issue the Google research team sought to alleviate. “Live Transcribe combines the results of extensive user experience research with seamless and sustainable connectivity to speech processing servers,” Salvo relates. To minimize network consumption, the team added “an on-device neural network-based speech detector, built on our previous work with AudioSet. This network is an image-like model which detects speech and automatically manages network connections to the cloud engine, minimizing data usage over long periods of use.”
For future releases, the Google team plans to enhance on-device recognition, speaker-separation, and speech enhancement.