Monday, May 30, 2011

A human touch for computer-generated voices


ISRAEL MINISTRY OF FOREIGN AFFAIRS

30 May 2011
Israel's Vivotext develops the world's first artificial text-to-speech system that actually sounds like a real person is talking.
  
A former concert pianist, Gershon Silbert applied his musical skills to technology to create Vivotext.
  A former concert pianist, Gershon Silbert applied his musical skills to technology to create Vivotext.
By David Halevi
That voice you hear when you call the phone company's information line most likely does not belong to a person. It belongs to a computer, as do the voices giving us information and instructions on GPS devices, websites, toys and games, cell phones, remote controls and many other places.

All use a technology called text-to-speech, or TTS, which uses computer-generated voices to read text from a computer document.

The sad truth is that TTS voices aren't pleasant to listen to, says the CEO of Israeli startup Vivotext, Gershon Silbert. His company has developed the world's first artificial TTS system that actually sounds human. "The market for artificial voices based on TTS technology is huge. TTS is used in so many areas, but many people find the voices hard to handle because they sound so mechanical," says Silbert.

Natural voices are rated highly by consumers
The cold-sounding, unemotional voices produced by products like AT&T Natural Voices Text-to-Speech (an oxymoron if there ever was one, in Silbert’s opinion), Nuance Realspeak and Loquendo Synthetic Speech, among others, are a real turnoff for consumers. Studies show that products with more natural-sounding artificial voices are rated far more highly by consumers than are standard TTS-voiced products.

The problem for manufacturers is that currently the only way to produce human-sounding voices, with all their emotion and inflection, is to use actual human beings - a much more expensive proposition.

"That's one reason the audiobook market is still so small," says Silbert. A mere two percent of books are converted to audiobook format, because the only ones that really sell are those read by humans (often celebrities, for many bestsellers). "If there were a cheaper, automated way to produce human-sounding TTS, publishers would pounce on it," he adds.

And that is exactly what Vivotext offers, says Silbert. "Our patented TTS technology is largely based on the field of music performance analysis, as well as on research in the areas of phonetics, syntax, lexicography, and digital signal processing [DSP]. We have an extensive collection of voice sample libraries that can be adjusted to add or emphasize a wide array of expressiveness."

That expressiveness - the ability to adjust emotion and feeling - is what makes Vivotext voices sound more human, and far more appropriate for almost any TTS-based product, says Silbert.

Applying music principles to speech
The music performance analysis technology that Silbert developed for Vivotext takes the principles used to convert printed music scores into actual music, and applies them to text-to-speech - resulting in a much more human-sounding, emotionally expressive computer-generated voice.

"Just as variation in tempo, articulation and dynamics contribute to the effectiveness of a musical performance, speech attributes such as pitch, duration and amplitude - known as prosody - are at the core of effective TTS, and are critical to conveying the phonemic, syntactic and pragmatic content of words and sentences," says Silbert.

Silbert knows about these things because he was, for many years, an internationally acclaimed pianist. His most well-received work, Bach's "Goldberg Variations on Piano," was produced in 1994. Now he is applying his expertise in music performance and production to making computer voices sound more human. Using a method called MOR (music objects recognition), "we can produce voices with highly intelligible enunciation and natural flow in a variety of speaking styles," says Silbert.

With the Vivotext system, programmers can use simple menu choices and desktop selection tools to add or subtract levels of emotion in a TTS sequence. The basic Vivotext system evaluates the context of a piece of text using phonetic, semantic and syntactic analysis - determining, for example, whether the text is a statement or question, part of a conversation or an informational request.

Sadness, longing, or enthusiasm
The initial evaluation also takes into account punctuation such as question marks, and adds that to the mix. Finally, the programmer can use menu choices to emphasize emotions - happy, sad, longing, enthusiastic, etc. - to produce a voice that sounds remarkably human.

Vivotext is based in and funded by the Mofet B'Yehuda incubator, located south of Jerusalem. Funding deals with several independent investors are in the works. The company's management team consists of Silbert, CTO Dr. Yossef Ben-Ezra and chairman Samuel H. Solomon.

The market for Vivotext is huge, and because there's nothing quite like it in the TTS world, the industry is beginning to take notice as Vivotext nears its goal of building a complete sample library. The company has several deals pending - one with a large US toy manufacturer and another with a major US audiobook publisher.

"We get a lot of inquiries at trade shows and industry forums, and we hear the same thing everywhere - how warm and human-like our voices are," says Silbert. "Anyone who uses artificial voices in their products absolutely loves what we are doing."