Paper

End-to-end attention-based large vocabulary speech recognition

Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters [1,2]. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We investigate an alternative method for sequence modelling based on an attention mechanism that allows a Recurrent Neural Network (RNN) to learn alignments between sequences of input frames and output labels. We show how this setup can be applied to LVCSR by integrating the decoding RNN with an n-gram language model and by speeding up its operation by constraining selections made by the attention mechanism and by reducing the source sequence lengths by pooling information over time. Recognition accuracies similar to other HMM-free RNN-based approaches are reported for the Wall Street Journal corpus.

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)Published 2016-03-01Paper linkPDF

Authors: Dzmitry Bahdanau · Jan Chorowski · Dmitriy Serdyuk · Philemon Brakel · Yoshua Bengio

Topics

Relevant entities

People

Related coverage

Linked coverage will appear here.

Related events

Linked events will appear here.

Related discussions

Related discussion nodes will appear here.