This AI can clone your voice just by listening for 5 seconds

A discussion on a cutting edge AI used for human voice cloning with minimal data

Photo by Owen Beard on Unsplash

This post is about some fairly recent improvements in the field of AI-based voice cloning. If we have hours and hours of footage of a particular voice at our disposal then that voice can be cloned using existing methods.But this recent breakthrough enables us to do the same using minuscule data — only five seconds of audio footage. The output generated using this method has timbre strikingly similar to the original voice and it is able to synthesize sounds and consonants that are non-existent in the original audio sample.It is able to construct these sounds on it’s own. You can listen to some generated samples here.

Here is how the interface looks like:

Image source can be accessed here (Reference[2])

The detailed diagram of the architecture is given below.This method is able to do what it does using the following three components.

Description of the architecture used as seen in the paper

1. The Speaker Encoder

It is basically a Neural Network trained on thousands of speakers and it squeezes the information learned from the training data into a compressed representation. In other words it learns the essence of human speech from a multitude of speakers. It uses the training audio sample footage to pick up the intricacies of human speech but this training need to be done only once. After that only five seconds of speech is enough to replicate the voice of an unknown speaker.

2. Synthesizer

It takes as input whatever we want our synthesized voice to say as text input and returns a Mel Spectrogram. A Mel Spectrogram is a concise representation of one’s voice and intonation. This part of the network is implemented using DeepMind’s Tactocron 2 technique.
The diagram below shows an example of Mel Spectrogram for male and female speakers. On the left we have a spectrogram of reference recordings of the voice sample we want to replicate and on the right we specify the piece of text that we want our synthesized voice to say and it’s corresponding synthesized spectrogram.

Mel Spectrogram for training and synthesized data as seen in the paper

3. The Neural Vocoder

Ultimately to listen to the learned voices we need to output a waveform. This is done by the Neural Vocoder component and it is implemented using DeepMind’s Wavenet Technique.

Measuring Similarity and Naturalness

Ultimately our goal is to output something that is similar to the voice of the target person but it should say something very different from the input sample in a natural manner. From the table below we can see that swapping the training and the test data drastically changes the naturalness and similarity of the synthesized voices. The detailed section of the paper describes how to work our way around these difficulties. The authors also define a metric called Mean Opinion Score that describes how well a cloned voice sample would pass as authentic human speech.

Similarity and Naturalness metrics as seen in the paper


This technology has a lot of promise for the future like generating voices of people who have lost theirs due to degenerative diseases.Also it might be used unscrupulously to clone voices of authoritative people and world leaders for wrong reasons.The only out of this would be to have proper techniques to detect whether a voice is natural or synthesized.


[1]Ye Jia et al-Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis(2018),NeurIPS 2018.


Disclaimer: I was not a part of the project and am merely providing commentary on the topic to the best of my understanding.The full credit behind the research work goes to the authors mentioned in the references and I highly recommend checking out their paper.