Note:HiFi-GAN is used as vocoder. Also, listen to the audios using headphones for better experience.

Abstract

Voiced by our Grad-TTS model:

Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score.

Illustration of mel-spectrogram reconstruction process for 50 refinement steps:

Note: Grad-TTS-50 model is considered. In this section we show intermediate mel-spectrogram states from score-based decoder in order to illustrate how reverse diffusion works.

Text:Here are the match lineups for the Colombia Haiti match.

Decoder state

Mel-spectrogram sample

Vocoded audio

Encoder outputs (mean of terminal distribution)

After sampling from terminal distribution

After 10 iterations

After 20 iterations

After 30 iterations

After 40 iterations

After 50 iterations

Comparison of generalized DPM vs. standard DPM:

Our generalized DPM framework, where we use text encoder outputs µ as mean of decoder terminal distribution, results in the lower number of reverse diffusion steps (number of backward ODE solver iterations) necessary for high-quality mel-spectrogram generation. To show the difference we trained additional Grad-TTS model reconstructing mels from standard normal terminal distribution N(0, I).

Note: Grad-TTS decoder neural network, which models gradients of data log-density, is still conditioned on text encoder outputs for both models.

Text:

Does the quick brown fox jump over the lazy dog?

There were others less successful.

Grad-TTS-10 with N(µ, I)

Grad-TTS-10 with N(0, I)

Grad-TTS-20 with N(0, I)

Grad-TTS-50 with N(0, I)

Sampling mel-spectrograms with different temperature:

Note: Grad-TTS-10 model is considered.

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Text:

It was said in last year's Democratic platform.

When he appeared before the Commission, Michael Paine lifted the blanket.

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

𝜏 = 1.0⠀⠀

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

𝜏 = 1.5⠀⠀

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

𝜏 = 3.0⠀⠀

⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Controllable speech tempo:

Text:Scientists at the CERN laboratory say they have discovered a new particle.

1.5×

1.25×

1×

0.75×

0.5×

Examples from MOS evaluation:

Note: Typically, only one model of Grad-TTS is trained. In the notation of Grad-TTS-N letter N corresponds to the overall number of timesteps used during inference to synthesize samples. The more is N the more accurate reverse diffusion trajectories are supposed to be restored resulting in better sound quality.

Text:

In all these lines the facts are drawn together by a strong thread of unity.

After the construction and action of the machine had been explained, the doctor asked the governor what kind of men he had commanded at Goree.

After a few years of active exertion the Society was rewarded by fresh legislation.