Speech and Language laboratory

Section 2.1 Speech and Language laboratory

The speech and language research group in SCSE was founded in 2007 by Chng Eng Siong and Prof Li Haizhou (now in CUHK-Shenzen, China). The group is now situated within HESL Lab - N4-B2b-05 in SCSE. We also founded the AISG Speech Lab funded by NRF since 2018~current.

Subsection 2.1.1 Research Focus

Our research interest is primarily speech and language processing, classifications using ML:

ASR and LLM
1. Using LLM to improve ASR by generative error correction: see Hyporadise
2. Code-switch multi-lingual speech recognition: see Audio to Byte
3. Robust Large vocabulary continuous speech recognition: joint end-to-end ASR with speech enhancement module, wave2vec2, speaker extraction
4. Speech enhancement: speaker extraction, denoising, feature enhancement, overlapping speech extraction
5. Faster decoding with end-to-end and real time android based decoders
6. Tranfer Learning: from large trained acoustic model (16KHz) to 8KHz models via transfer learning
Classification
1. Noisy Audio event and scene classifications, Audio captioning DCase
2. Speaker identification and speaker diarization: diarization, VAD, and speaker extraction issues, see Microsoft diarization approach
3. Deep Fake Detection (and generation)Link
Towards Speech Understanding - some aspects of NLP such as topic detection, name entity recognition, text normalization. See a demo of our ASR for ATC speech with NER highlighting. ATC with NER

Examples of relevant papers to the research area include: sequence to sequence model which has been widely studied in machine translation. The problems we are keen on include

Code switch end to end and Adaptation -> how to improve the model in certain target environment (speaker, noise, type of dialogues), etc. Code-switch End-to-end
Classification-> what type of sound is this? Audio Scene and Event Analysis
Speaker id: who spoke it: speaker id under overlapping condition and when (Diarization).
Speech Enhancement - speaker extraction and derevberation.

Subsection 2.1.2 Demos

Some of our previous works:

Youtube recordings: Our code-switch speech recognition in action:
1. Recognizing English/Mandarin code-switch speech using our LVCSR system (2018 June).
2. Comparing our system against Google, Siri (2018 Sep).
Source separation - Separating Hillary Clinton and Trump voice from Youtube recording, from Chenglin's Demo slide (Oct 2018)
Speech indexing using our MAGOR system (Code-switch English/Mandarin and Malay system)
See a demo of our ASR for ATC speech with NER highlighting. ATC with NER

Subsection 2.1.3 Our recent demos using our speech engine

2020 FYPs demo:

Deploying Speech Recognition System using high availability and scalability kubernetes cluster Youtube
Chatbot framework using Dialog flow and various Q and A modules (2020 Demo) Youtube and a live demoDemo

Subsection 2.1.4 Some of our recent works in git

PhD Student Hou Nana's work in NTU (2018~2021), single channel speech enhancement, github
PhD Student Xu Chenglin's work in NTU (2015~2020), single channel speech separation/extration,github
Intern GeMeng's work (intern from Tianjin 2020~2021), tutorial speech separation, github
Intern Shangeths work (intern from BITS) (2020 Aug- 2021 June), Accent, Age, Height classificationPdf link
MSAI student Samuel Samsudin (2020~2021), emotion detection, github depository, kaggle iEmoCap
Language Identification by EEE's PhD student Liu Hexin (2021) github link
Intern Shashank Shirol's work (2020 Jan-June), using GAN to create noisy speech, github depository