Visualization of the Keyword Recording

The following lists 5 different keyword commands in the time domain and the frequency-time (spectrum) domain.
Bed
Up
Down
Yes
No
It’s interesting to find out a few facts from these pictures.
First, in each category, each soundtrack all have a similar frequency response, called formants. Sometimes, though, it is not very similar (such as the following two “Yes” commands), we still can see the formants.
Second, these recordings are all recorded in 16,000 Hz sampling rate, and in our spectrum analysis, the frequency resolution is 50 Hz. However, the frequency resolution of our human ears is up to 3.6Hz in the octave of 1,000–2,000Hz [2]. It means, a human is more precise and can hear much smaller details than machines. But a good question is that “Do we need that precise to distinguish each command?” Hope there are more results next week.
Hope there are more results next week.
The dataset is produced by Google with the link here.
References:
- [0] my GitHub code
- [1] Speech representation and data exploration
- [2] Psychoacoustics

Comments

Popular posts from this blog

How regularization affects the critical points in linear networks

MNIST Dataset

Entropy and Mutual Information