top of page

A Study of Pitch Range Estimation Based on Neural Network

Partner: Tiantian Li, Jingyi Wang, Jiaxing Jiang

My role: team leader, programmer

May 2018 to July 2019

Research Background

Research Background

What is pitch range:

  • Pitch range is the highest and lowest pitch a person can reach when talking, which is decided physiologically and varies from person to person.

Four tones in Mandarin and their relationship with pitch range:

  • There are four unique tones in Mandarin Chinese.

  • The four tones have their own range of change according to the personal pitch range.

声调英1.jpg
Motivation & Idea

Motivation & Idea

  • As a student of Beijing Language and Culture University full of foreign friends learning Chinese, we found that there is a common problem when teaching Chinese as a foreign language, which is the weird pronunciation of the four unique tones in Chinese.

  • There are plenty of Apps that help foreign friends in learning Chinese. However, due to techniques, we can’t offer users a correct pronunciation according to their actual pitch range. For example, when a high pitch girl pronounces in the wrong tone, we can only offer her the correct example read by a man with a bass voice, which is not helpful when they imitate and correct their pronunciation.

语音项目_英文.jpg

Traditional Method

About 4h acoustic signals

Manually computing 

Complicated and time-consuming

Our Method

Only one Short-term acoustic signal about 300ms

Neural Network

Accurate and fast

Goal

Goal

  • Find out the neural network with the best performance on pitch range estimation.

  • Detailed system design.

  • Make a precise API for the best-performed model.

Technical Workflow

Technical Workflow

结果-技术路线英.jpg
Step1: Dataset Preparing

Step 1: Dataset preparing

Based on the AISHELL open source Mandarin Speech Corpus and the Mandarin Speech Corpus Recorded by Japanese Chinese Language Learners(made by BLCU Speech Acquisition and Intelligent Technology Lab), I used Praat to extract ceiling, floor, average and variance of every record, and used Kaldi on the Linux Server to extract the Fbank features. Then I used python to arrange them into the right form as the training and testing datasets.

语料库英.png
数据集格式英.png

The prepared dataset of one Speech

Step 2: DNN, RNN, LSTM building and optimizing

My main role is building and optimizing the DNN, RNN, LSTM with Keras base on Tensorflow.

  • Increase number of hidden layers and neurons

  • Use regularization to prevent overfitting

  • Try different activation functions

  • Modify learning rate

  • Add early stop

  • ...

While using these ways to optimize DNN, RNN, LSTM separately, I used tensorboard to visualize the training process to find out the problem and increase the accuracy. I tried nearly 90 different combinations of hypermeters.

Step2: Nerual Networks
tensorflow.png
keras.png

The Best Performed LSTM Model

LSTM2.png
LSTM1.png

The Best Performed LSTM Model

三个最优模型对比.gif

Optimal Models Comparison

  • After comparing and choosing the optimal model, I added the dataset to further optimize the chosen LSTM(2 hidden layers - 512 neurons - 0.8 dropout) . Eventually, The best performed model can reach 98% accuracy. 

  • To proved that short time acoustic signals processed with deep neural network are capable of estimating the pitch range of a speaker. I fed the Fbank features involving 50ms, 100ms and 300ms into the LSTM. 

ceiling.gif
floor.gif
mean.gif

The graphs show that:

  • Compared with the initial model, the accuracy of the optimized model is improved.

  • With the increase of speech duration, the prediction error decreases continuously, but the difference is within the acceptable range.

In order to proved that the model can assist the teaching Chinese as a foreign language, we also fed the test dataset recorded by Japanese Chinese Learners(made by BLCU Speech Acquisition and Intelligent Technology Lab) . The results are shown in the following figure:

泛化能力.gif

Step 3: API  Making & Detailed System Design

Step3: API & Detailed System Design

Based on python, I made a precise API. It can load our optimal model and estimate the pitch range on one speech.

结果-系统设计英.jpg
bottom of page