Synthesizing Your Voice

A recent article in the “Science and technology” section of The Economist magazine discussed advances in the effort to synthesize the human voice. There has long been a small cottage industry dedicated to this work, which used to be called voice banking. It was primarily designed to benefit people who were at risk of losing their ability to speak, either due to illness or to impending surgery.

People in this situation would spend one or more days in a studio to record a variety of phrases, each using several different emotional deliveries and contexts. Sort of like “three in a row” on steroids. But in this case, there would be the added urgency of knowing that at some point in the future, the result of this effort would comprise the entire range of one’s possible available verbal expression. That would probably be a strong motivator, even for someone with no training in acting.

Voice Banking Gets Disrupted

This industry, like so many others, is now being disrupted by technology. Companies in Europe, Asia and the US are developing software that can produce a synthetic version of a person’s voice based on a much smaller number of examples of that person’s speech…as few as 50 to 150 samples of spoken phrases or sentences.

Even the name of the practice has been modernized; now it is generally known as voice cloning rather than voice banking. Voice cloning software retrieves and reassembles the individual phonemes in the spoken samples to create words “on the fly”. So rather than being restricted to the specific phrases (and deliveries) recorded using the old method, this new approach could in theory be used to “say” anything.

Like many new technologies, voice cloning relies on computing power and clever software to create its magic. While these initial efforts are reportedly understandable, they are also largely recognizable as non-human speech, but efforts are underway to make them more lifelike. The potential applications could be as disparate as making a workplace robot speak in the boss’s voice (which sounds fairly creepy), or allowing a mother to personalize her child’s favorite toy with her own voice. Clearly, given the explosive growth in the “internet of things”, there is a nearly unlimited number of ways to use this technology.

Unfortunately, there are also plenty of potentially harmful ways to use it. Even now, some of the software can clone a voice based on just a few minutes of recording retrieved online. In other words, the cloned voice could belong to anyone: a stranger, a celebrity, or a politician. Any device or account that requires voice-id might be vulnerable to hacking, and disinformation campaigns might have a potentially powerful new tool.

Will Voice Cloning Compete with Voiceover?

Driverless technology is already perceived as a long-term threat to the trucking industry. Is voice cloning a similar threat to our industry?

Probably not at present, since the technology is still is a few steps away from being convincing to the human ear. But we know that the software engineers will continue to make improvements, just as the engineers at IBM did in order to allow the Big Blue computer to beat Gary Kasparov (the reigning world chess champion at the time) in their second match in 1997, after losing to him in 1996.

So it seems a good bet that, at some point in the future, some parts of the voiceover business may be disrupted by voice cloning. Maybe some IVR applications, or maybe small local businesses with tight ad budgets. The temptation to use a cloned publicly available voice might be difficult to resist, at least for some. Maybe it will create a new branch of intellectual property law, as celebrities attempt to reign in the unauthorized use of their voices.

The Challenge For Voice Actors

The humanists among us (that includes me) will root for homo sapiens in this “man vs. machine” contest. And the technologists can counter with software to detect the telltale digital signature of audio produced by voice cloning. Such software is currently being developed, and it is conceptually akin to the de-clicker software used in our industry, only with a different target and purpose. It is another chapter in the technological cat-and-mouse game.

But ultimately, I think the question for the voiceover industry is this: “Can a machine learn copy connection, empathy, passion, irony, or any other emotional value required for a good read?” That is quite a different challenge than the tactical/strategic one that allowed Big Blue to beat Kasparov. I am betting that the answer will remain “no” for quite a while. But even if it is just a far-off blip on the voiceover threat radar, we should use voice cloning as a reminder to always commit to the script, whatever it is.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.