Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

Olga Yakovenko, Ivan Bondarenko

Research output: Chapter in Book/Report/Conference proceedingConference contributionResearchpeer-review

Abstract

For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

Original languageEnglish
Title of host publicationTheory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings
EditorsCarlos Martín-Vide, Miguel A. Vega-Rodríguez, Miin-Shen Yang
PublisherSpringer Science and Business Media Deutschland GmbH
Pages54-66
Number of pages13
ISBN (Electronic)978-3-030-63000-3
ISBN (Print)9783030629991
DOIs
Publication statusPublished - 2020
Event9th International Conference on Theory and Practice of Natural Computing, TPNC 2020 - Taoyuan, Taiwan, Province of China
Duration: 7 Dec 20209 Dec 2020

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume12494 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference9th International Conference on Theory and Practice of Natural Computing, TPNC 2020
CountryTaiwan, Province of China
CityTaoyuan
Period07.12.202009.12.2020

Keywords

  • Audio feature representation
  • Speech recognition
  • Variational Autoencoder

Fingerprint Dive into the research topics of 'Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems'. Together they form a unique fingerprint.

Cite this