MNIST Dataset

- October 03, 2018

MNIST Dataset is a well known handwritten digits dataset, and lots of people use it as their first pattern recognition practice. I starts with this github tutorial and with this reference website.

There are 60,000 samples in the trainning set, and 10,000 samples in the testing set. Each sample has 28x28 pixels with values vary from 0~255 and each digit is centered with 20x20 pixels. A sample is shown as belowed.

Note that numbers of samples for each digit in trainning set are not the same (i.e. not 6,000 samples for each digit), and not in testing set either.

In this post, we try to use 8 kernels to built a filter for 10 digits. If we set the probability function matrix $[p_0\ p_1\ \dots\ p_9]^T$ corresponding to digit $n$ from 0 to 9, and each probability function can be written as \[p_n(t)=P\{X=n|\ y(1),y(2),\dots,y(t-1),y(t)\},\ \forall\ n\in\{0,1,\dots,9\},\] where $y(t)$ is an observation value, and at this moment, we assume $y(t)$ can be modeled as \[y(t)=h_k(n,t)+N_k(t),\] where $h_k(n,t)$ are filters and $N_k(t)$ is a normal distribution noise with zero mean and variance ${\sigma_k}^2(t)$: \[N_k(t)\sim N(0,{\sigma_k}^2(t)).\]Note that time $t$ here is an integer in set $[l,L)$, which will be further discussed later.

First, let's build filters $h_k(n,t)$. Each filter $h_k$ corresponds to different kernels $w_k$ but with same time window $l=4$ pixels. Here, we try 8 kernels, listed as below:

$w_1(t)=\sin\left(\dfrac{\pi x}{L}\right)$	$w_3(t)=\sin\left(\dfrac{2\pi x}{L}\right)$	$w_5(t)=\sin\left(\dfrac{3\pi x}{L}\right)$	$w_7(t)=\sin\left(\dfrac{4\pi x}{L}\right)$
$w_2(t)=\sin\left(\dfrac{\pi y}{L}\right)$	$w_4(t)=\sin\left(\dfrac{2\pi y}{L}\right)$	$w_6(t)=\sin\left(\dfrac{3\pi y}{L}\right)$	$w_8(t)=\sin\left(\dfrac{4\pi y}{L}\right)$

The $x$ and $y$ direction are shown in the very first sample picture and value $L$ here is equivalent to the length of the digit image, which is 28. A gif animation of how kernel 1 and 3 work on digit sample is shown below.

via GIPHY

Note that we normalized the pixel value from 0~255 to 0~1, and also, from the gif, time $t$ is in $x$ direction since $w_1(t)$ and $w_3(t)$ are functions of $y$. In fact, filters $h_k$ are actually results of the convolution of each digit image and each kernel function. Take $h_1(n,t)$ for example: \[h_1(n,t)=\int_{t-l}^t\int_0^L\text{img}(x,y)\sin\left(\dfrac{\pi y}{L}\right)\ dy\ dx.\]

Next, we compute each $h_k(n,t)$ for the very first 100 samples in each digit dataset and take the average and variance to build the final filters $h_k(n,t)$, shown as belowe:

However, from our assumption, noise $N_k(t)$ is merely a function of time $t$, which means it is not based on digit $n$. So we futher take the average among all digits at each time $t$.

After $h_k(n,t)$ and $N_k(t)$ are constructed, we are now able to calculate $p_n(t)$ based on conditional probability theory: \[\begin{align*} p_n(t)&=P\{X=n|y(1),y(2),\dots,y(t-1),y(t)\}\\\\ &=\dfrac{P\{y(t)|X=n\}\ P\{X=n|y(1),y(2),\dots,y(t-1)\}}{P\{y(t)|y(1),y(2),\dots,y(t-1)\}}\\\\ &\propto P\{y(t)|X=n\}\ p_n(t-1).\end{align*}\] Note that $P\{y(t)|y(1),y(2),\dots,y(t-1)\}$ does not depend on digit number $n$, and therefore, we can write the unnormalized probability $\tilde p_n(t)$ as:\[\tilde p_n(t)=P\{y(t)|X=n\}\ \tilde p_n(t-1),\] where \[P\{y(t)|X=n\}=\prod_k\text{exp}\left(-\dfrac{(y(t)-h_k(n,t))^2}{2{\sigma_k}^2(t)}\right).\] In addition, the relation between $p_n(t)$ and $\tilde p_n(t)$ is defined as: \[p_n(t)=\dfrac{\tilde p_n(t)}{\sum\limits_{m=0}^9\tilde p_m(t)}.\] Furthermore, for coding simplicity, we take $\mu_n(t)=\ln\tilde p_n(t)$, and then we have \[\mu_n(t)=-\sum_k\dfrac{(y(t)-h_k(n,t))^2}{2{\sigma_k}^2(t)}+\mu_n(t-1).\]

The followings are 10 testing data under evaluation:

The accuracy table:

Search This Blog

Hanson.hsChang

MNIST Dataset

Comments

Post a Comment

Popular posts from this blog

How regularization affects the critical points in linear networks

Entropy and Mutual Information