Generating high quality synthetic speech using GAN as vocoder post-processing

  • Лэюань Шэн

Student thesis: Master's Thesis


In speech synthesis and enhancement systems, Mel spectrograms need to be precisely generated as acoustic representations. However, the generated spectrograms are too over-smooth to produce high quality synthesized speech. To address this issue, inspired of image-to-image translation, a learning-based post filter is proposed by combing pix2pixHD and ResUnet to reconstruct the Mel spectrograms with super-resolution for high-quality speech synthesis. The resulting super-resolution spectrogram networks can generate enhanced spectrograms, which produce high quality synthesized speech. Our proposed model achieves an increase in mean opinion scores (MOS) of 0.44 and 0.17 over the baseline results of 3.29 and 3.84 using vocoder Griffin-Lim and WaveNet, respectively.
Date of AwardJun 2019
Original languageEnglish
Awarding Institution
  • Programming Section
SupervisorЕвгений Николаевич Павловский (Supervisor)

Cite this