TY - GEN
T1 - Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition
AU - Yakovenko, Olga
AU - Bondarenko, Ivan
N1 - Publisher Copyright:
© 2021, Springer Nature Switzerland AG.
PY - 2021
Y1 - 2021
N2 - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
AB - For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.
KW - Audio feature representation
KW - Speech recognition
KW - Variational autoencoder
UR - http://www.scopus.com/inward/record.url?scp=85107369094&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-71214-3_10
DO - 10.1007/978-3-030-71214-3_10
M3 - Conference contribution
AN - SCOPUS:85107369094
SN - 9783030712136
T3 - Communications in Computer and Information Science
SP - 115
EP - 126
BT - Recent Trends in Analysis of Images, Social Networks and Texts - 9th International Conference, AIST 2020, Revised Supplementary Proceedings
A2 - van der Aalst, Wil M.
A2 - Batagelj, Vladimir
A2 - Buzmakov, Alexey
A2 - Ignatov, Dmitry I.
A2 - Kalenkova, Anna
A2 - Khachay, Michael
A2 - Koltsova, Olessia
A2 - Kutuzov, Andrey
A2 - Kuznetsov, Sergei O.
A2 - Lomazova, Irina A.
A2 - Loukachevitch, Natalia
A2 - Makarov, Ilya
A2 - Napoli, Amedeo
A2 - Panchenko, Alexander
A2 - Pardalos, Panos M.
A2 - Pelillo, Marcello
A2 - Savchenko, Andrey V.
A2 - Tutubalina, Elena
PB - Springer Science and Business Media Deutschland GmbH
T2 - 9th International Conference on Analysis of Images, Social Networks, and Texts, AIST 2020
Y2 - 15 October 2020 through 16 October 2020
ER -