PORTFOLIO | Dongying LIU
A Study of Pitch Range Estimation Based on Neural Network
Partner: Tiantian Li, Jingyi Wang, Jiaxing Jiang
My role: team leader, programmer
May 2018 to July 2019
Research Background
What is pitch range:
-
Pitch range is the highest and lowest pitch a person can reach when talking, which is decided physiologically and varies from person to person.
Four tones in Mandarin and their relationship with pitch range:
-
There are four unique tones in Mandarin Chinese.
-
The four tones have their own range of change according to the personal pitch range.
Motivation & Idea
-
As a student of Beijing Language and Culture University full of foreign friends learning Chinese, we found that there is a common problem when teaching Chinese as a foreign language, which is the weird pronunciation of the four unique tones in Chinese.
-
There are plenty of Apps that help foreign friends in learning Chinese. However, due to techniques, we can’t offer users a correct pronunciation according to their actual pitch range. For example, when a high pitch girl pronounces in the wrong tone, we can only offer her the correct example read by a man with a bass voice, which is not helpful when they imitate and correct their pronunciation.
Traditional Method
About 4h acoustic signals
Manually computing
Complicated and time-consuming
Our Method
Only one Short-term acoustic signal about 300ms
Neural Network
Accurate and fast
Goal
-
Find out the neural network with the best performance on pitch range estimation.
-
Detailed system design.
-
Make a precise API for the best-performed model.
Technical Workflow
Step 1: Dataset preparing
Based on the AISHELL open source Mandarin Speech Corpus and the Mandarin Speech Corpus Recorded by Japanese Chinese Language Learners(made by BLCU Speech Acquisition and Intelligent Technology Lab), I used Praat to extract ceiling, floor, average and variance of every record, and used Kaldi on the Linux Server to extract the Fbank features. Then I used python to arrange them into the right form as the training and testing datasets.
The prepared dataset of one Speech
Step 2: DNN, RNN, LSTM building and optimizing
My main role is building and optimizing the DNN, RNN, LSTM with Keras base on Tensorflow.
-
Increase number of hidden layers and neurons
-
Use regularization to prevent overfitting
-
Try different activation functions
-
Modify learning rate
-
Add early stop
-
...
While using these ways to optimize DNN, RNN, LSTM separately, I used tensorboard to visualize the training process to find out the problem and increase the accuracy. I tried nearly 90 different combinations of hypermeters.
The Best Performed LSTM Model
The Best Performed LSTM Model
Optimal Models Comparison
-
After comparing and choosing the optimal model, I added the dataset to further optimize the chosen LSTM(2 hidden layers - 512 neurons - 0.8 dropout) . Eventually, The best performed model can reach 98% accuracy.
-
To proved that short time acoustic signals processed with deep neural network are capable of estimating the pitch range of a speaker. I fed the Fbank features involving 50ms, 100ms and 300ms into the LSTM.
The graphs show that:
-
Compared with the initial model, the accuracy of the optimized model is improved.
-
With the increase of speech duration, the prediction error decreases continuously, but the difference is within the acceptable range.