Google software engineers have announced Tacotron 2, a new version of the company’s text-to-speech (TTS) synthesis system that comes the closest yet to replicating human speech.
In a Google Research Blog engineers Jonathan Shen and Ruoming Pang explain that their approach generates human-like speech from text using neural networks trained using only speech examples and corresponding text transcripts.
Tacotron 2, which draws on earlier Tacotron and WaveNet projects, maps a sequence of letters to a sequence of features that encode the audio.
These features, an 80-dimensional audio spectrogram with frames computed every 12.5 milliseconds, capture not only pronunciation of words, but also various subtleties of human speech, including volume, speed and intonation.
The features are then converted to a 24 kHz waveform. So does it work?
The authors, writing on behalf of the Google Brain and Machine Perception Teams, provide a series of audio samples so you can decide for yourself.
In an evaluation where we asked human listeners to rate the naturalness of the generated speech, we obtained a score that was comparable to that of professional recordings.
The blog says. However the quest for human-like TTS is not over yet:
While our samples sound great, there are still some difficult problems to be tackled. For example, our system has difficulties pronouncing complex words (such as “decorum” and “merlot”), and in extreme cases it can even randomly generate strange noises.
Also, our system cannot yet generate audio in realtime. Furthermore, we cannot yet control the generated speech, such as directing it to sound happy or sad. Each of these is an interesting research problem on its own.
A full description of the new system can be found here.