What happens when you focus on 'everything but the words'
If you’ve ever decided to take a look at your interactions with a voice assistant like Alexa or Google Assistant, you’ll notice that you can not only see (and delete) transcripts of your commands and questions but also play back the actual voice recordings.
It’s an odd thing to experience, listening back to the songs you were requesting six months ago. Much more so than simply reading the transcripts. You get a real sense of ‘you’ on that day, maybe you were just laughing with your family and that bleeds into it or you sound bored out of your skull.
Read this: How Amazon, Google and Apple use your smart speaker data
That’s just the start, according to Rébecca Kleinberger, a research assistant at MIT Media Lab’s Opera of the Future group. For her PhD, she is combining research in neurology, physiology, music, voice coaching and more to look at “the subconscious clues we express each time we talk”, how we can make people more aware of their own voices and building new experiences around the voice.
Over the course of six years researching our voices, her work has included research into the physical vibrations our bodies produce and how that relates to vibration therapy; why we don’t like the sound of our own voice; how we map our muscles to produce sounds (our ‘vocal posture’); and how deep learning can be used to do real-time speaker identification, even across people speaking multiple languages.
With the rise of smart speakers in our homes, recording our voices every time we say those magic wake words, we thought it was worth examining what we convey through voice to humans and machines. And Kleinberger says that whether we realise it or not, “The human brain has evolved to be extremely good at analysing all those hidden elements from the voice. If they are detectable via machines, it means that somehow our brain detects them from other people’s voices.”
She suggests thinking about these unconscious elements as “the pheromones of the voice, acoustic pheromones almost, that influence us in many ways without us being aware.”
The signals your voice sends out
Who you’re talking to
Our voice changes depending on who we’re talking to and the context, so much so that researchers, and their algorithms, can tell if you’re talking to your mother, your boss or your friend. They could even detect the age of the person you’re talking to.
“Even when you’re trying to speak normally, when you speak to a young child, your voice changes,” says Kleinberger. “The prosody, the musicality that you use in your voice is very different to when you talk to adults.”
Your mood
We know that Amazon’s Alexa teams are already directing research into voice analysis that can detect, for example, when the person speaking is in a rush and needs information quickly. Google, too, is looking to emotional AI that is guided by the user’s mood as part of its future.
And it’s possible. In 2014, a team of computer scientists at the University of Michigan launched a smartphone app called Priori that was designed to monitor phone calls to detect early signs of mood changes in people suffering from bipolar disorder. “These pilot study results give us preliminary proof of the concept that we can detect mood states in regular phone calls by analysing broad features and properties of speech, without violating the privacy of those conversations,” the project lead Zahi Karam said.
How long your relationship will last
This one’s pretty crazy. In her TED Talk on why we don’t like our own voices, which I urge you to watch or listen to, Kleinberger points out that machine analysis of conversations between married couples can be used to predict if and when you will divorce.
Last year a team at the University of Southern California published a study showing that AI analysis of pitch, variation of pitch and intonation in conversations between 134 couples, taking part in therapy, had 79.3% accuracy at predicting the marital outcomes i.e whether or not the relationship would last. That’s actually slightly higher than human experts who were correct 75.6% of the time.
If you’re on your period… or pregnant
“I find the link between hormone levels and the voice fascinating,” says Kleinberger. “We know that it has an effect. We know that even our brains consciously detect it without us really being able to understand clearly. It is purely acoustic information that give us clues about the speaker’s hormone level. I think this could have tremendous consequences, good or bad in terms of detection, ethics and spying.”
Multiple studies by Nathan Pipitone and Gordon Gallup have shown that listeners can detect where a female speaker is in her menstrual cycle, by asking male participants to rate voices in terms of attractiveness. Of the first in 2008 at Albany State University, they wrote: “Results showed a significant increase in voice attractiveness ratings as the risk of conception increased across the menstrual cycle in naturally cycling women.”
Then there’s the fact, as Kleinberger says in her TED Talk, that one day voice assistants could know you’re pregnant before you do. Again, this is very much based on real research.
Studies in 2012 at the Hospital Italiano de Buenos Aires, Argentina and in 2008 at the Beirut Medical Center, Lebanon found differences between the voices of control groups and groups of pregnant women. The Beirut study found both similarities and some differences: “There were no significant differences in the incidence of vocal symptoms in pregnant women versus controls. However, vocal fatigue was more prevalent in the pregnant group. With respect to the acoustic parameters, there was a significant decrease in the MPT (maximum phonation time) at term.”
Dr. Alexa will hear you now
One person who you might like to share your day to day conversations with is your doctor. It turns out that the medical method of listening to the body has been practised for thousands of years – and it’s in the middle of an update when it comes to voice.
“The word auscultation was used by the ancient Greeks,” says Kleinberger. “It’s this notion of understanding the body by the sound coming from the body. So basically when the doctor puts the stethoscope to listen to your heart, that’s a type of auscultation. Using the voice as an auscultation tool has been used for heart disease and lung disease for a long time, listening to the breathiness of the patient’s voice. But now we’re really starting to do machine modulated auscultation.”
Recent research into which illnesses we can detect via voice has included studies into depression and Parkinson’s. Max Little, a mathematician and now an associate professor at Aston University, found that by analysing people’s voices from a 30 second phone call, algorithms could pick up subtle turbulences and changes of texture. These sound natural to the human ear but Little was able to use them to detect Parkinson’s disease early on, with accuracy rates around 99%. Kleinberger explains that it is very difficult for a person suffering from Parkinson’s to hold a conversation as their voice gets tired and is not as easily controllable as non sufferers.
When it comes to depression, the conventional wisdom is that people simply speak more slowly: “It’s actually more complex than that but it still has to do with the tempo. For the work I’ve done on depression, what’s interesting is the variation of tempo from one word to another, and from individual syllables, in those words.”
We know that Silicon Valley’s ambitions when it comes to health no know bounds. Alphabet has its own Verily Life Sciences spin-off building glucose monitoring contact lenses and medical grade health watches. Amazon meanwhile appears to have a secret health and wellness team within its Alexa division, under the suitably vaguely named Alexa Domains.
In theory, an always-on, ambient smart home assistant that knows we’re sick before we do and can prompt us to seek out medical attention, with specific reasons, could save millions of lives.
“With home devices, what can they do with the voice recordings that might be either good or bad for us? I’m not sure,” says Kleinberger. “If they help us, say ‘maybe you should go see a doctor for a check up’ – maybe that could be good. If they use it for company profits, maybe that’s slightly less good. That’s why informing the public of the potential, whether that’s what happens today or two, five or ten years, I think it’s so important to know. We talk about data all the time, well you have a lot of data in your voice. It’s not only what you say.”
The future is mimicry
Indeed, when we talk about privacy in the smart home in 2018 we focus purely on the content of these voice recordings and transcripts, what they might be worth to advertisers, how they can be combined with other information Amazon, Google and Apple collect about us from services like Google Maps, Gmail, Amazon Prime, iTunes, Google search, iOS and Android etc etc.
That’s added to a growing public suspicion that everyone from Facebook, Instagram, Amazon is listening in on our conversations via smartphone microphones (the tech companies in question deny it) and patents illustrating that Amazon has designs on all our conversations, not just those beginning with a wake word.
The other half of the puzzle when it comes to voice assistants, says Kleinberger, is the voices of Alexa, Assistant and Siri. We know that tech companies choose friendly, mostly female voices that are “not too high pitched but not too low pitched” and “not too dominating”. Right now we don’t speak to them like we speak to other humans – I have a very specific ‘instructing voice’ I use for Alexa and Google Assistant. It’s slightly louder, slightly sterner, verging on patronising and I usually look at the smart speaker or device when I’m talking to check it’s caught it.
“Most of the research and most of the population shows that we use a different voice when we talk to the machine,” she says. “I haven’t done research on this and it’s not proven but I suspect that it’s closer to how a rude, snobbish person would talk to a waiter in a restaurant. Would there be an interest for companies to go further and pass the uncanny valley? It’s a little bit of a Turing test in terms of vocal texture. What would it take to create an answering voice that is enough to make us consider this technology as human?”
Starting with child-directed speech, then robot and machine-directed speech, this is an area of research which is growing – Kleinberger has recently been spending time at the San Diego Zoo to study animal-directed and cross-species speech. But, she says, it’s not clear how a voice assistant that sounds much more like a human, and could analyse our emotions in real time via our voice, would benefit… humans.
“It’s going in the direction of mimicry. As humans, we do unconscious mimicry of other people’s vocal parameters all the time. Accents and stuttering are contagious. Would that be a good thing or a bad thing if machines start doing that?
“I think pretty soon technology is going to go in this direction. If people are aware that it’s happening, it doesn’t have to be a bad thing. Maybe it will render some of this technology slightly less frustrating. If your voice is obviously upset, the machine detects that and instead of saying ‘oh, you sound upset’, if the machine starts speaking in the same vocal mode as you, is that going to help or not? If it goes into more of a manipulation mode, it’s a tricky question but an interesting one.”
In the end, it comes down to the balance of power between you and the multi-million dollar company that has developed the voice assistant you’re talking to. They know the power of the signals our voices send out so the more we’re aware of what humans and machines can detect from our voices, and broadcast from their own, the better.