img

Notice détaillée

Conventional and contemporary approaches used in text to speech synthesis: a review

Article Ecrit par: Kaur, Navdeep ; Singh, Parminder ;

Résumé: Nowadays speech synthesis or text to speech (TTS), an ability of system to produce human like natural sounding voice from the written text, is gaining popularity in the field of speech processing. For any TTS, intelligibility and naturalness are the two important measures for defining the quality of a synthesized sound which is highly dependent on the prosody modeling using acoustic model of synthesizer. The purpose of this review survey is firstly to study and analyze the various approaches used traditionally (articulatory synthesis, formant synthesis, concatenative speech synthesis and statistical parametric techniques based on hidden Markov model) and recently (statistical parametric based on deep learning approaches) for acoustic modeling with their pros and cons. The approaches based on deep learning to build the acoustic model has significantly contributed to the advancement of TTS as models based on deep learning are capable of modelling the complex context dependencies in the input data. Apart from these, this article also reviews the TTS approaches for generating speech with different voices and emotions to makes the TTS more realistic to use. It also addresses the subjective and objective metrics used to measure the quality of the synthesized voice. Various well known speech synthesis systems based on autoregressive and non-autoregressive models such as Tacotron, Deep Voice, WaveNet, Parallel WaveNet, Parallel Tacotron, FastSpeech by global tech-giant Google, Facebook, Microsoft employed the architecture of deep learning for end-to-end speech waveform generation and attained a remarkable mean opinion score (MOS).


Langue: Anglais