-
【10】Towards End-to-End Speech Recognitionwith Recurrent Neural Networks.pdf下载
资源介绍
This paper presents a speech recognition system
that directly transcribes audio data with text,
without requiring an intermediate phonetic representation.
The system is based on a combination
of the deep bidirectional LSTM recurrent neural
network architecture and the Connectionist Temporal
Classification objective function. A modification
to the objective function is introduced
that trains the network to minimise the expectation
of an arbitrary transcription loss function.
This allows a direct optimisation of the word error
rate, even in the absence of a lexicon or language
model. The system achieves a word error
rate of 27.3% on the Wall Street Journal corpus
with no prior linguistic information, 21.9% with
only a lexicon of allowed words, and 8.2% with a
trigram language model. Combining the network
with a baseline system further reduces the error
rate to 6.7%.