Hong Kong researchers present an audio-semantic pre-training model for robust speech recognition

Automatic speech recognition (ASR) has surpassed all other forms of modern human-machine interaction thanks to the proliferation of high-tech Internet of Things (IoT) tools. However, ASR presents difficulties in real world scenarios due to background noise, speaker tone, gender, etc. The speech signal picked up by the microphone is usually distorted by background noise. Speakers may pronounce word accents differently from the norm reflected in the lexicon and training data, resulting in a misclassification of the ASR.

Conversely, in Chinese ASR, there are many polyphonic words with identical pronunciations, which makes distinguishing the correct word from the candidate words more difficult, especially when the speech is contaminated.

The audio model (AM) converts the speech wave into a telephone or network sequence. The Language Model (LM) decodes the AM output into a natural language syntax, making up the bulk of the synthetic ASR pipeline. For Chinese polyphonic words, traditional LM lacks the power to withstand the noisy AM output. The second LM decoding pass is unlikely to correct a flawed first attempt. Some current work suggests training the ASR with intentionally noisy speech data to solve these problems.

According to studies, most misclassified words can be retrieved from context information if LM can extract semantics from the contaminated context by studying worst-case scenarios of typical ASR systems. In addition to not using context information for phone sequences, the error is also propagated at every stage of conventional two-pass LM decoding. Therefore, it is better to convert the AM output directly to the sentence with full context.

The researchers found that using a Transformer could help make LM more resilient to polluted calls. Meanwhile, pre-training is useful for alleviating the difficulty by providing a more powerful LM that can convert the damaged phone sequence into the intended sentence, as evidenced by the recent success of the pre-training model in several Natural Language Processing (NLP) tasks.

A new study by the Hong Kong University of Science and Technology and WeBank proposes training a decoder model that can convert a phone sequence into a phrase by matching phone numbers to words.

The proposed framework combines novel heuristics related to telephone disturbance, self-supervised training techniques, and massive amounts of unpaired pre-training text data to train its models. They first perform data augmentation to replicate the model output error during training to make the acoustic model more robust. They then use the phone sequences generated by the acoustic model on different acoustic datasets as inputs to the pre-training model and evaluate the results along with those generated by the traditional ASR system.

Experimental evaluation of disturbed synthetic speech datasets with varying signal-to-noise ratios (SNRs) demonstrates the proposed framework’s resistance to environmental noise. The team conducted experiments on two independent, real-world organizations and evaluated them against several industry-standard ASR standards. The results show that the models outperformed the latest ASR pipelines, achieving relative error rates (CER) of 28.63% and 26.38%, respectively.

The researchers plan to maximize the effectiveness of noise-strong LM pre-training by exploring more efficient PSP pre-training techniques with larger unpaired datasets in the upcoming stages.

This Article is written as a research summary article by Marktechpost Staff based on the research paper 'A Phonetic-Semantic Pre-Training Model for Robust Speech Recognition'. All Credit For This Research Goes To Researchers on This Project. Check out the paper and reference article.

Please Don't Forget To Join Our ML Subreddit

Leave a Comment