Search This Blog

Hanson.hsChang

Home
Research Logs

Research Logs

Get link
Facebook
X
Pinterest
Email
Other Apps

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

How regularization affects the critical points in linear networks

$\newcommand{\transpose}{\intercal}$ $\DeclareMathOperator*{\minimize}{minimize}$ Given an input initial random vector $X_0\in\mathbb{R}^n$ with $p_X$ distribution and covariance matrix $\Sigma_{X_0}=\mathbb{E}[X_0{X_0}^\transpose]$. Assume the input-output model is in the following linear form: \[Z=RX_0+\xi,\] where $\xi\in\mathbb{R}^n$ is the noise and $Z\in\mathbb{R}^n$ is the output. In addition, the noise $\xi$ is assumed to have $p_\xi$ distribution and be independent to the input $X_0$, i.e. $\mathbb{E}[\xi{X_0}^\transpose]=0$. The problem is using i.i.d. input-output samples $\{({X_0}^{(k)},Z^{(k)})\}_{k=1}^K$ to learn the weights of a linear feed-forward neural network \[\dfrac{dX_t}{dt}=A_tX_t\] in order to match the input-output relation $R$. Note that $A_t$ are the network weights, $t$ denotes the input layer with at most depth $T$, and $K$ is the total number of trainning samples. Consider the following regularized form of the optimization problem: \[\begin{align} ...

MNIST Dataset

MNIST Dataset is a well known handwritten digits dataset, and lots of people use it as their first pattern recognition practice. I starts with this github tutorial and with this reference website . There are 60,000 samples in the trainning set, and 10,000 samples in the testing set. Each sample has 28x28 pixels with values vary from 0~255 and each digit is centered with 20x20 pixels. A sample is shown as belowed. Note that numbers of samples for each digit in trainning set are not the same (i.e. not 6,000 samples for each digit), and not in testing set either. In this post, we try to use 8 kernels to built a filter for 10 digits. If we set the probability function matrix $[p_0\ p_1\ \dots\ p_9]^T$ corresponding to digit $n$ from 0 to 9, and each probability function can be written as \[p_n(t)=P\{X=n|\ y(1),y(2),\dots,y(t-1),y(t)\},\ \forall\ n\in\{0,1,\dots,9\},\] where $y(t)$ is an observation value, and at this moment, we assume $y(t)$ can be modeled as \[y(t)=h_k(n,t)+N...

Entropy and Mutual Information

1. $\textbf{Entropy}$ definition: Let $X$ be a continuous random variable, defined on probability space $(\Omega,\mathcal{F},\mathcal{P})$ and $X\in\mathcal{X}\subseteq\mathbb{R}$, with cumulative distribution function (CDF) given by \[F(x)=\mathcal{P}\{X\leq x\},\] and probability density function (pdf) given by \[f(x)=\dfrac{dF(x)}{dx},\] which are both assumed to be continuous functions. Then, the entropy of a continuous random variable $X$ is defined as \[h(X)=-\int_\mathcal{X} f(x)\log f(x)dx,\] where the integration is carried out on the support of the random variable. Example 1.1 Entropy: Entropy of a normal distribution $X\sim\mathcal{N}(0,\sigma^2)$, $f(x)=\dfrac{1}{\sqrt{2\pi\sigma^2}}\exp{\left(-\dfrac{x^2}{2\sigma^2}\right)},$ is \[\begin{align*} h(X)&=-\int f(x)\log f(x)dx\\ &=-\int f(x)\left[-\dfrac{1}{2}\log{\left(2\pi\sigma^2\right)}-\dfrac{x^2}{2\sigma^2}\log e\right]dx\\ &=\dfrac{1}{2}\log{\left(2\pi\sigma^2\right)}\int f(x)dx+\dfrac{\log e}{2\sigma...