Your voice is yours alone – as unique to you as your fingerprints, eyeballs and DNA.
Unfortunately, that doesn’t mean it can’t be spoofed. And that reality could undermine one of the promised security benefits of multi-factor authentication, which requires “something you are,“ along with something you have or you know. In theory, even if attackers can steal passwords, they can’t turn into you.
But given the march of technology, that is no longer a sure thing. Fingerprints are no longer an entirely hack-proof method of authentication – they can be spoofed.
That will soon be true of your voice as well.
The risk goes well beyond recent warnings from the Federal Communications Commission (FCC) and Better Business Bureau (BBB) about spam callers trying to get a victim to say the word “yes,” which they record and then use to authorize fraudulent credit card or utility charges, or to “prove” that the victim owes them money for services never ordered.
This technology is aimed at “cloning” an individual’s voice accurately enough to make him or her say anything you want. The potential risks are obvious: If your phone requires your voice to unlock it, an attacker with some audio of your voice could do it.
It is not perfect yet. But it is already remarkably close. A demonstration last fall at Adobe Max 2016 of the company’s VoCo, nicknamed “Photoshop for voice,” turned a recording of a man saying, “I kissed my dogs and my wife,” into “I kissed Jordan three times.” The audience went crazy.
The pitch for the product: “With a 20-minute voice sample, VoCo can make anyone say anything.”
More recently, researchers from the University of Montreal’s Institute for Learning Algorithms laboratory, announced that they are seeking investors for their voice imitation software, Lyrebird, which they say will be able to mimic any voice from as little as a minute of audio recording.
According to Scientific American, the Lyrebird technology relies on, “artificial neural networks – which use algorithms designed to help them function like a human brain – that rely on deep-learning techniques to transform bits of sound into speech.”
The researchers say the system can then adapt to any voice based on only a one-minute sample of someone’s speech.
The exciting – or ominous – implication is that, as Scientific American put it, after learning the, “pronunciation of characters, phonemes and words in any voice … it can extrapolate to generate completely new sentences and even add different intonations and emotions.”
Once perfected, there are numerous possibilities for mischief – well beyond simply creating comedic videos spoofing the voices of your favorite celebrities. Besides undermining voice-based verification, leading to identity theft or other fraud – Santander Bank was running ads just this past week on voice verification – it could eliminate the use of voice or video recordings as evidence in court.
Lyrebird itself, in a brief ethics statement on its website, acknowledges its product, “could potentially have dangerous consequences such as misleading diplomats, fraud and more generally any other problem caused by stealing the identity of someone else.”
The statement adds that, “by releasing our technology publicly and making it available to everyone, we want to ensure that there will be no such risks.”
The technology is getting mixed responses from the security community. Bruce Schneier, CTO of IBM Resilient, author and encryption guru, told Scientific American that fake audio clips have become, “the new reality.”
On his own blog, Schneier wrote: “Imagine the social engineering implications of an attacker on the telephone being able to impersonate someone the victim knows. I don't think we're ready for this.”
But that got a bit of pushback on his comment thread. One reader argued, “As a species ‘we are never ready’ for what comes along, we learn to adapt through experience, it's probably our strongest survival skill.”
Another commenter, noting that this concern is not new, cited a report from 2003 about a professor at Oregon Health & Science University’s OGI School of Science & Engineering questioning whether audiotapes periodically released by the late terrorist mastermind Osama bin Laden were real.
“Because voice transformation technologies are increasingly available, it is becoming harder to detect whether a voice has been faked,” said Jan van Santen, a mathematical psychologist at the university.
But, of course, the audio quality of those recordings was notoriously poor. The quality of voice imitation now, coming from Adobe’s VoCo, Alphabet’s (parent company of Google) WaveNet and Lyrebird is orders of magnitude better, and expected to become even better in the next year or two.
Still, authentication experts say voice can still be a credible factor in confirming identity – as long as it is not the only factor.
“If the sole determination of identity is voice, we are in trouble,” said James Stickland, CEO of Veridium.
But, if it is one element of what he called “an ensemble” that includes possession (something you have, like a token) and knowledge (password), voice can still “play an integral role” in authentication.
If, as Schneier wrote, we are not ready for voice spoofing technology, that is because, “most people still segment possession, knowledge-based and biometric authentication,” Stickland said. “The future of authentication combines all of these and more.”
Brett McDowell, executive director of the FIDO (Fast IDentity Online) Alliance, agrees that, “voice recognition is vulnerable to a presentation attack; where the adversary records a sample of the targeted user's physical characteristics and uses that to produce an imposter copy or ‘spoof’ of that user's biometrics.”
He also agreed that it is relatively easy for hackers to get biometric information. “We leave fingerprints on most everything we touch, and both our images and voices are easily recorded without our knowledge or permission,” he said.
But, he said biometrics can still be an effective layer of security if, as FIDO standards specify they are, “limited in scope to only the first of a two-step process that also requires physical possession of the authorized user's personal device.”
That would mean an end to what George Avetisov, CEO of HYPR, terms “centralized authentication,” where a biometric identifier is stored in a database and, “an individual's information is compared against a whole library of others’ similar information at each authentication request.”
With a decentralized system – the one recommended by FIDO – “there is no central storage of biometric data,” he said. When users “enroll” a biometric, like voice, “they do so locally and it is encrypted and stored on-device.
“The hackers would not only have to re-create the voice of the target, they would also have to have physical access to the person’s mobile device, which is exponentially more difficult and economically infeasible,” he said.
The “warehousing” of personally identifiable information (PII), he said, needs to end, since it can (and has) result in, “a catastrophic data breach such as in the OPM (Office of Personnel Management) case,” in which the private data of more than 21 million current and former federal workers was compromised.
In short, if your voice isn’t stored locally on your device, experts say it will become relatively easy for hackers to get into your device, access your bank account and more.
Indeed, Stickland said the technology will likely reach the point where an attacker could even carry on a credible conversation using a target’s voice. “It’s called phishing and it happens every day over email and phone,” he said.
Avetisov agreed, saying voice spoofing will even be able to mimic individual speech characteristics like patterns, cadence and phrasing.
“Machine learning and artificial intelligence is advancing at an astonishing pace and it's only a matter of time before minor imperfections in such a system are identified and resolved,” he said.
But McDowell said machine learning can help the good guys as well. “We are in a new arms race between hackers who are trying to defeat biometrics with higher resolution spoofs, and the biometrics industry that keeps innovating both the sensitivity of their sensors as well as their PAD (Presentation Attack Detection) capabilities,” he said.
Those can include, “having users blink when using a face recognition system or having them say a passphrase when using a voice recognition system, or having the fingerprint sensor read below the skin for characteristics that cannot be spoofed by a fake fingerprint.”
Those capabilities can be enhanced by “behavioral biometrics,” Stickland said, which include “a multitude of measurements about our behaviors,” including location, “how you hold your phone, the pace of your walking etc. Sophisticated machine and deep learning algorithms embedded in your phone will learn your movement, speech and other behavior patterns.”