Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Olga Yakovenko, Ivan Bondarenko

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Abstract

For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

Original languageEnglish
Title of host publicationRecent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings
EditorsWil M. van der Aalst, Vladimir Batagelj, Alexey Buzmakov, Dmitry I. Ignatov, Anna Kalenkova, Michael Khachay, Olessia Koltsova, Andrey Kutuzov, Sergei O. Kuznetsov, Irina A. Lomazova, Natalia Loukachevitch, Ilya Makarov, Amedeo Napoli, Alexander Panchenko, Panos M. Pardalos, Marcello Pelillo, Andrey V. Savchenko, Elena Tutubalina
PublisherSpringer Science and Business Media Deutschland GmbH
Pages115-126
Number of pages12
ISBN (Print)9783030712136
DOIs
Publication statusPublished - 2021
Event9th International Conference on Analysis of Images, Social Networks, and Texts, AIST 2020 - Virtual, Online
Duration: 15 Oct 202016 Oct 2020

Publication series

NameCommunications in Computer and Information Science
Volume1357 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference9th International Conference on Analysis of Images, Social Networks, and Texts, AIST 2020
CityVirtual, Online
Period15.10.202016.10.2020

Keywords

  • Audio feature representation
  • Speech recognition
  • Variational autoencoder

OECD FOS+WOS

  • 1.02 COMPUTER AND INFORMATION SCIENCES
  • 1.01 MATHEMATICS

Fingerprint

Dive into the research topics of 'Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition'. Together they form a unique fingerprint.

Cite this