“I hear voices” or does Siri have a face
We hear voices all the time: in the subway, in navigators and in our smartphones. And if there is no doubt that the voices in the metro belong to real people, then the answer to the question of who voices virtual assistants and robots may soon cease to be so unambiguous.
On the other hand, the voice actors should not be afraid of losing their jobs for now, because even for the voice acting of the BB-8 robot from Star Wars, Bill Hader, the host of the famous American show Saturday Night Live on NBC, was recruited. More about everything in today’s material.
Almost everyone has heard the sound of the American version of Siri, but few people think that this voice belongs to a real person, professional voice actress, Susan Bennett (Susan Bennett). True, the actress herself, while working on the recording, did not even imagine that her voice would sound from every pocket. The fact is that the recording was made by a text-to-speech company, which was later acquired by Apple.
In 2005, Susan spent 20 hours a week in the recording studio, but it was a very stressful 20 hours: she had to take frequent breaks, drink a lot of water and recite absolute nonsense, consisting of all kinds of unrelated words. In order for the sounds to be later combined into the necessary words that would sound natural, it is necessary to pronounce all possible combinations of sounds in the language. And the revision of the voice acting in 2011 took already 4 months, although the “Siri voice” worked only for two hours a day.
Susan Bennett herself explains more about Siri and how the recording went in her TED Talks:
The actress worries about the insecurity of the rights of voice actors – their voice can be used for any purpose, and they do not receive any additional money even for such commercial use.
The British male version of Siri, Daniel, was voiced by TV and radio host Jon Briggs, who also didn’t know his voice would be used for Siri until he saw the commercial on TV. He also recorded voice for Scansoft in 2005. It was later acquired by Nuance, which worked with Apple to develop Siri. While working, John recorded 5,000 sentences in three weeks, but unlike Susan, he is quite satisfied with the fee received for the voice acting.
Women versus men
But the actress who records the voice for Google Now prefers not to show her face. But you can see how the recording process itself takes place:
The actress notes that this process is quite complicated, since it is necessary to speak at the same tempo and with one timbre. It is impossible to change the voice throughout the entire recording, while the correct intonation must be observed. But at Google, this is monitored by a team of linguist and stage speech specialist, which ultimately allows for more natural speech.
In the case of Microsoft’s Cortana, the situation is completely different: the very image and name of the virtual assistant was borrowed from the Halo series of games. Therefore, for her voice acting, the same actress who worked on the voice of the heroine of the same name in video games was invited. Jen Taylor knew exactly what the tapes would be used for, and she did not hide in any way and even played the role of Cortana in the miniseries “Halo 4: Walking Towards Dawn” in 2012.
Most virtual assistants speak in a female voice or are called female names. Some even see it as a digital sexism. However, research results show that the female voice is more often chosen by the users themselves. People think he sounds friendlier, while masculine is perceived as more aggressive.
This, of course, is not always the case; intonation and timbre play an important role. The difference between the perception of two different male voices can be seen in the example of Mark Zuckerberg’s home virtual assistant. The assistant’s name is Jarvis, and with the voice of Morgan Freeman, he is perceived as a very courteous and well-mannered system.
We ride, we ride, we ride
Even more people are faced with synthesized voice when using navigators. The male voice of Yandex.Navigator was recorded by a professional announcer, but an employee of the company was involved to record the female version. The recording took only 3 hours, and the text fit on 4 sheets, which, in comparison with the voice acting of virtual assistants, is quite a bit.
Separate words are used to construct the sentences spoken by the navigator, but entire phrases had to be pronounced on the recording to make the text sound more natural. For the voice acting of the navigator, Vasily Utkin was invited to the Olympiad, who spent several hours in the studio and uttered 160 phrases. Only 120 are used in the navigator, but the creators promised to change some of them in order to diversify the trip. And Vasily even invented some phrases himself.
Subway announcements also have their own peculiarities. For example, the first recordings with modern metro voices were made more than 20 years ago, which means that they were recorded on reels of film. Therefore, the actors had no room for error. More precisely, if a mistake was made, you had to rewrite everything all over again. Even now, if you need to add new information to a record, you have to re-record the voice acting of the entire branch as a whole.
And not only Siri has a face, but also the Moscow metro. In fact, there are even three of them: actors, radio and TV presenters Yulia Romanova-Kutina, Sergey Kulikovskikh and Alexey Rossoshansky. For different holidays, celebrities or children are involved in the voiceover of announcements. But what exactly the voices say on the subway can be influenced by ordinary people. For example, after activists expressed their dissatisfaction with the phrase “Request to release the wagons”, it was changed to “Request to get out of the wagon.”
But in the near future, speech synthesis will happen very differently thanks to the development of Google. WaveNet does not synthesize speech from fragments of human voice recordings: the program reproduces sound waves, analyzing them using convolutional neural networks.
In addition to her voice, she can even imitate music. So far, this technology is still quite expensive, since it takes a lot of resources and time to train networks and process records, but already now 50% of people in the control group have taken WaveNet speech as human. And in the future it will be possible to imitate the voice and intonation of any person, however, for training, you still need voice recordings of real people.