Deep Learning Models for English Speech Recognition System
Author : Harika Thokala and Dr. Manisha N Rathod
Abstract :
The paper is a comparative analysis of deep learning models used in English speech recognition, including Convolutional Neural Networks (CNN), Long Short-Term Memory networks (LSTM), and Transformer networks. The system was trained with a variety of English speech dataset and noise levels and accents using MFCC and spectrogram features. The experimental results indicate that Transformer model performs better and has the Word Error Rate (WER) of 8.9% and Character Error Rate (CER) of 4.1, which is better than LSTM (WER: 11.3%, CER: 5.6) and CNN (WER: 14.8%, CER: 7.2%). Transformer also has high performance of 16.4 percent WER under heavy noise compared to LSTM and CNN with respective 21.6 and 27.3 percent. The results indicate that the models that rely on attention are much better in terms of recognition accuracy and robustness under real world circumstances.
Keywords :
Speech Recognition, Deep Learning, CNN, LSTM, Transformer.