Skip to the content.

Goal:

The goal of this project is to benchmark machine learning techniques when used to classify sound files.

Supervised Machine Learning Techniques used:

Unsupervised Machine Learning Techniques used:

The dataset used can be found in the repository and is called MNIST_Spoken_Digits.rar. This dataset contains spoken digits from 0 to 9. There are 3000 .wav files in the dataset and each individual spoken digit has 300 .wav files. There are 6 different speakers who generated these .wav files. Each sound file is transformed into a spectrogram prior to classification so this project uses image recognition techniques to try and classify sounds.

image

Benchmarking

Cross validation combined with confusion matrices were the main benchmarks. K means has it’s own way of benchmarking since it is an unsupervised technique and will be discussed later.

Cross Validation:

10-fold cross validation was used in each test. This means that each dataset was randomly divided into 10 subsets and 10 tests were used. In each test, one of the subsets would be considered as test data while the remaining 9 subsets would be considered as training data.

Confusion Matrix:

The confusion matrix serves as a visual to see how well each machine learning technique performed. In my confusion matrices, the predicted labels are along the x-axis while the true labels are along the y-axis. Any number that is not along the diagonal of the matrix represents incorrect classifications. (Note: There are 10 different tests so the confusion matrices being shown are a sum total of all 10 confusion matrices.)

Gaussian Naive Bayes:

Gaussian Naive Bayes is a model based supervised machine learning technique. It assumes that each class (or each of the spoken digits from 0 to 9) can be approximated by a Gaussian distribution (or equivalently, normal distribution). The training data is used to get the maximum likelihood estimators (MLE) for the parameters of each Gaussian distribution for each class.

image

Overall Accuracy: 65.59%

imageimageimage

Decision Tree:

Decision Trees try to classify data points by separating each data point by their features (or equivalently the columns in this case). Generally, decision trees perform better when the features are highly correlated to the class that is being predicted. This means that decision trees generally do not perform well for image recognition such as our tests with spectrograms. Each pixel of the spectrogram is a feature in this case but each pixel also contains very little information about the classification of the overall data point.

Even though the results are expected to be poor, the goal of this project is to benchmark each technique. This particular experiment is to verify the expected result.

image

Overall Accuracy: 63.76%

imageimageimage

K Nearest Neighbors

image

Overall Accuracy: 90.86%

imageimageimage

K Means

K Means was mostly included for curiosity as there is no direct way to compare it with the other supervised machine learning techniques.

image

Conclusion

image

This doesn’t nullify the findings but it does suggest that K nearest neighbors performs well at recognizing a person’s voice as opposed to recognizing a spoken digit. Further research will be needed to see if this is true or not.