Ataraxia through Epoché: [ML] Neural Networks

Neural networks:

Idea:
There could be more than 2 features, i.e x1, x2, x3,..., xn
We could have ϴ0, ϴ1,..., ϴv, while 'v' maps to the number of
terms in a polynomial.

Think about letting computer to recognize a picture of car.

A picture with 50 x 50 pixels image -> 2500 pixels

x = [
pixel 1 intensity (0-255)
pixel 2 intensity (0-255)
...
pixel 2500 intensity (0-255)]

So, if we still want to count the features of all these combination,
there would be about xi * xj := 3 millions of features, how's that?!

Way too large for previous learning function.

Here's how neural networks comes in...

-------
NN:
Algorithms that try to mimic the brain.
Diminish at the late 90's, however, thank to distributed/powerful computing,
NN has come alive again.

Auditory cortex:
Was used for hearing. However, if we cut the input of hearing and reroute
the imput signal from seeing, auditory cortex will start learning to see.

Idea:
Plug-in any sensors to the brain, brain will start to learn.

---
NN model representation I:

Neuron:
input wire: Dendrite
ouput wire: Axon

-----
Artificial neuron design:
Neuron model: logistic unit

Input:
x0(bias unit, always equal to 1), x1, x2, x3 (as features)

processing:
1 neuron

output:
h(x)

Sigmoid (logistic) activation function.
i.e:
g(z) = 1 / ( 1+e^(-z) )

----
Now, let's talk about network of neurons.

input: (layer 1)
x0(bias unit, always equal to 1), x1, x2, x3 (as features)

processing: (layer 2, aka hidden layer)
multiple neurons. (also with a0 neuron as bias unit)

output: (layer 3)
h(x)

---
So,
ai^(j) = "activation" of unit i in layer j
ϴ^(j) = matrix of weights controlling function mapping from
layer j to layer j + 1

---
Vectorize the computation:

a1^(2) = g( z1^(2) )
where z1^(2) = ϴ10^(1)x0 + ϴ11^(1)x1 + ... + ϴ13^(1)x3

=>
ϴ^(i)x

----
Neural network learning it's own features:

Rather than taking the sensor input from layer 1,
i.e x1, x2, x3
we are using a1, a2, a3 as the new input.

-----

Ok, let's unmask layer 1.
The input x1, x2, x3 for a^(i) can be predicted by
layer 1's ϴ1^(i-1), ϴ2^(i-1), ϴ3^(i-1), neat!

-----
Neural network can be compose by architectures:
i.e Multiple layers, but still,
only 1 input layer and 1 output layer. Others are hidden layer.

-----
Examples:

Non-linear classification example: XOR/XNOR

And Function:

Or Function:

-------
ϴ is also called 'weights'

Negation:

---
pipe line them together!

(Not x1) And (Not x2)

---
x1 XNOR x2

---

NN representation of Multi-class classification:

e.g 0-9 numbers, there are 10 catagories.

example:
Pedestrian
Car
Motorcycle
Truck

This a h(x) ∈ R^4

For Pedestrian:
[ 1
0
0
0]

For car:
[ 0
1
0
0]

For Motorcycle:
[ 0
0
1
0]

For truck:
[ 0
0
0
1]