Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems

Olga Yakovenko, Ivan Bondarenko

Результат исследования: Публикации в книгах, отчётах, сборниках, трудах конференцийстатья в сборнике материалов конференциинаучнаярецензирование

Аннотация

For many Automatic Speech Recognition (ASR) tasks audio features as spectrograms show better results than Mel-frequency Cepstral Coefficients (MFCC), but in practice they are hard to use due to a complex dimensionality of a feature space. The following paper presents an alternative approach towards generating compressed spectrogram representation, based on Convolutional Variational Autoencoders (VAE). A Convolutional VAE model was trained on a subsample of the LibriSpeech dataset to reconstruct short fragments of audio spectrograms (25 ms) from a 13-dimensional embedding. The trained model for a 40-dimensional (300 ms) embedding was used to generate features for corpus of spoken commands on the GoogleSpeechCommands dataset. Using the generated features an ASR system was built and compared to the model with MFCC features.

Язык оригиналаанглийский
Название основной публикацииTheory and Practice of Natural Computing - 9th International Conference, TPNC 2020, Proceedings
РедакторыCarlos Martín-Vide, Miguel A. Vega-Rodríguez, Miin-Shen Yang
ИздательSpringer Science and Business Media Deutschland GmbH
Страницы54-66
Число страниц13
ISBN (электронное издание)978-3-030-63000-3
ISBN (печатное издание)9783030629991
DOI
СостояниеОпубликовано - 2020
Событие9th International Conference on Theory and Practice of Natural Computing, TPNC 2020 - Taoyuan, Китайская Провинция Тайвань
Продолжительность: 7 дек 20209 дек 2020

Серия публикаций

НазваниеLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Том12494 LNCS
ISSN (печатное издание)0302-9743
ISSN (электронное издание)1611-3349

Конференция

Конференция9th International Conference on Theory and Practice of Natural Computing, TPNC 2020
СтранаКитайская Провинция Тайвань
ГородTaoyuan
Период07.12.202009.12.2020

Fingerprint Подробные сведения о темах исследования «Convolutional Variational Autoencoders for Audio Feature Representation in Speech Recognition Systems». Вместе они формируют уникальный семантический отпечаток (fingerprint).

Цитировать