Service Items

SERVICE ITEMS

Speech recognition AI solution

Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the lexical content of human speech into computer-readable input, such as keys, binary codes or character sequences. Unlike speaker recognition and speaker confirmation, the latter attempts to recognize or confirm the speaker who utters the voice rather than the lexical content contained therein.


Speech recognition system is generally divided into two stages: training and decoding. Training, that is, training the acoustic model through a large number of labeled voice data; Decoding is to recognize the speech data outside the training set into characters through acoustic model and language model. The quality of the acoustic model directly affects the recognition accuracy.


Speech recognition technology has been developed for decades. Until 2009, Hinton introduced the AI deep learning solution into speech recognition, speech recognition has made a great breakthrough. In essence, the traditional Gaussian mixture model (GMM) is replaced by a deep neural network (DNN) model. The traditional GMM extracts speech features (as shown in the left figure) through multiple processes, while the DNN model extracts speech features (as shown in the right figure), which can simplify a lot of work. It does not need to make assumptions about the distribution of voice data, and does not need to segment into streams to fit segments; The input of DNN can splice adjacent speech frames into vectors containing temporal structure information. More context information is used at the frame level, and the relative recognition error rate is reduced by more than 20%, which is more than the sum of the past many years. The key here is to replace the manual features modeled by GMM in the original model with more complex feature learning through DNN. After that, in the framework of deep learning, people continue to use better models, such as RNN, LSTM and more training data to further improve the results. Deep learning enables the accuracy rate of speech recognition to reach 99%, which is enough to be applied in practical scenarios other than experimental tests, and is widely commercial.