Neural Network | DONGYING LIU

A Study of Pitch Range Estimation Based on Neural Network

Partner: Tiantian Li, Jingyi Wang, Jiaxing Jiang

My role: team leader, programmer

May 2018 to July 2019

Research Background

What is pitch range:

Pitch range is the highest and lowest pitch a person can reach when talking, which is decided physiologically and varies from person to person.

Four tones in Mandarin and their relationship with pitch range:

There are four unique tones in Mandarin Chinese.
The four tones have their own range of change according to the personal pitch range.

Motivation & Idea

As a student of Beijing Language and Culture University full of foreign friends learning Chinese, we found that there is a common problem when teaching Chinese as a foreign language, which is the weird pronunciation of the four unique tones in Chinese.
There are plenty of Apps that help foreign friends in learning Chinese. However, due to techniques, we can’t offer users a correct pronunciation according to their actual pitch range. For example, when a high pitch girl pronounces in the wrong tone, we can only offer her the correct example read by a man with a bass voice, which is not helpful when they imitate and correct their pronunciation.

Traditional Method

About 4h acoustic signals

Manually computing

Complicated and time-consuming

Our Method

Only one Short-term acoustic signal about 300ms

Neural Network

Accurate and fast

Goal

Find out the neural network with the best performance on pitch range estimation.
Detailed system design.
Make a precise API for the best-performed model.

Technical Workflow

Step1: Dataset Preparing

Step 1: Dataset preparing

Based on the AISHELL open source Mandarin Speech Corpus and the Mandarin Speech Corpus Recorded by Japanese Chinese Language Learners(made by BLCU Speech Acquisition and Intelligent Technology Lab), I used Praat to extract ceiling, floor, average and variance of every record, and used Kaldi on the Linux Server to extract the Fbank features. Then I used python to arrange them into the right form as the training and testing datasets.

The prepared dataset of one Speech

Step 2: DNN, RNN, LSTM building and optimizing

My main role is building and optimizing the DNN, RNN, LSTM with Keras base on Tensorflow.

Increase number of hidden layers and neurons
Use regularization to prevent overfitting
Try different activation functions
Modify learning rate
Add early stop
...

While using these ways to optimize DNN, RNN, LSTM separately, I used tensorboard to visualize the training process to find out the problem and increase the accuracy. I tried nearly 90 different combinations of hypermeters.

178 hours Open Source Mandarin Corpus

Step2: Nerual Networks

The Best Performed LSTM Model

The Best Performed LSTM Model

Optimal Models Comparison

After comparing and choosing the optimal model, I added the dataset to further optimize the chosen LSTM(2 hidden layers - 512 neurons - 0.8 dropout) . Eventually, The best performed model can reach 98% accuracy.
To proved that short time acoustic signals processed with deep neural network are capable of estimating the pitch range of a speaker. I fed the Fbank features involving 50ms, 100ms and 300ms into the LSTM.

Using tensorboard to visualize optimization process

The graphs show that:

Compared with the initial model, the accuracy of the optimized model is improved.
With the increase of speech duration, the prediction error decreases continuously, but the difference is within the acceptable range.

In order to proved that the model can assist the teaching Chinese as a foreign language, we also fed the test dataset recorded by Japanese Chinese Learners(made by BLCU Speech Acquisition and Intelligent Technology Lab) . The results are shown in the following figure:

Step 3: API Making & Detailed System Design

Step3: API & Detailed System Design

Based on python, I made a precise API. It can load our optimal model and estimate the pitch range on one speech.

A Study of Pitch Range Estimation Based on Neural Network

Research Background

What is pitch range:

Four tones in Mandarin and their relationship with pitch range:

Motivation & Idea

Traditional Method

Our Method

Goal

Technical Workflow

Step 1: Dataset preparing

The prepared dataset of one Speech

Step 2: DNN, RNN, LSTM building and optimizing

Increase number of hidden layers and neurons

Use regularization to prevent overfitting

Try different activation functions

Modify learning rate

Add early stop

...

The Best Performed LSTM Model

The Best Performed LSTM Model

Optimal Models Comparison

After comparing and choosing the optimal model, I added the dataset to further optimize the chosen LSTM(2 hidden layers - 512 neurons - 0.8 dropout) . Eventually, The best performed model can reach 98% accuracy.

To proved that short time acoustic signals processed with deep neural network are capable of estimating the pitch range of a speaker. I fed the Fbank features involving 50ms, 100ms and 300ms into the LSTM.

The graphs show that:

Compared with the initial model, the accuracy of the optimized model is improved.

With the increase of speech duration, the prediction error decreases continuously, but the difference is within the acceptable range.

In order to proved that the model can assist the teaching Chinese as a foreign language, we also fed the test dataset recorded by Japanese Chinese Learners(made by BLCU Speech Acquisition and Intelligent Technology Lab) . The results are shown in the following figure:

Step 3: API Making & Detailed System Design

Based on python, I made a precise API. It can load our optimal model and estimate the pitch range on one speech.