Evonik Digital Research

A joined research project with University Duisburg-Essen

Sensing Abnormal Situations with Audio

In this part of work, a deep convolutional neural network was optimized and trained for sound classification. The work in this section belongs to supervised learning. We converted audio clips into spectrograms and imported them into the deep convolutional neural network to distinguish between explosions and non-explosions. This part of the work is divided into three steps:

Data collection and data marking:
For the non-explosive audio, as a comparison with the explosion sound, we collected 2000 fragments of basic music material that covers different kinds of sound, such as people talking, animal sound, wind sound, traffic sound and so on. For the explosive audio data, we collected 450 clips of high bit rate explosion audio fragments.
Converting audio clips into spectrograms:
We generated the spectrogram images of these datasets for the neural network.
Training the neural network:
Firstly, the spectrogram data prepared in the second step were used for training the last layer of the convolutional neural network. Then a relatively bigger learning rate is used to train the remaining layers of the neural network. Finally, a small learning rate is used to fine tune the parameters of the neural network.

Finally, the classification accuracy of this neural network for training set is about 91%. We used 50 explosion fragments and 50 non-explosion fragments as a test set. The data in the test set were not used in the training procedure. For the neural network, these data were totally new.

The Confusion Matrix shows the quality of the detection. In the experiment, among 50 explosive sound fragments, 39 of them were classified correctly as such type. Among 50 non-explosive sounds, all 50 of them are classified correctly as such type.