Ataraxia through Epoché: [ML] Classification

Classification:

Binary Class Modification:
Y ∈ {0, 1}
0: Negative Class
1: Positive Class

Multiclass modification:
Y ∈ {0, 1, 2 3 ...}

f(x) = {0, 1}

Linear regression method:
Use h(x) = ϴ[Transpose] X

Threshold classifier output: h(x) at 0.5:

if h(x) >= 0.5, predict "y=1"
else, predict "y=0"

However, if we extend the X's domain range, this won't work.

*Using linear regression won't be a good idea for classification problem.

Thus, we use:
Logistic Regression: It's a classification algorithm
0 <= h(x) <= 1

Logistic Regression Model:
Want 0 <= h(x) <= 1

h(x) = g(ϴ[Transpose] x) ------------ (1)

Sigmoid function / Logistic function:
g(z) = 1 / (1 + e^(-z) ) ------------ (2)

Thus with (1) and (2)
=>
h(x)= 1 / ( 1 + e^(-ϴ[Transpose] x) )

Interpretation of Hypothesis output:
h(x) = estimated probability that y=1 on input x

Example:
if x = [ x0; x1] = [1; tumor size]

while h(x) = 0.7 with input features,
means that we have 70% the input is likely to be 1, i.e. True.

"probability that y=1, given x, parameterized by ϴ"
h(x) = P(y=1 | x;ϴ)
P(y=1 | x;ϴ) + P(y=0 | x;ϴ) = 1

Decision Boundary:
Suppose
predict "y=1" if h(x) >= 0.5
predict "y=0" if h(x) < 0.5

i.e:
When ϴ[Transpose] x) >= 0,
We have h(x) = g(ϴ[Transpose] x) >= 0.5

When ϴ[Transpose] x) < 0,
We have h(x) = g(ϴ[Transpose] x) < 0.5

Decision Boundary:
h(x) = g( ϴ0 + ϴ1x1 + ϴ2x2)

that is, if we want
y =1, i.e h(x) >= 0.5
i.e ϴ0 + ϴ1x1 + ϴ2x2 >= 0

e.g:
ϴ = [-3;1;1]
ϴ[Transpose]x = -3 + x1 + x2 >= 0
x1 + x2 >= 3

Paramater ϴ is to define the boundary.
How do we find ϴ, that is the question.

----------------------
Cost function:

Ok, for training set, we mean, giving the inpux x, there's 1:1 y.
This is already know.

By putting the x into h(x), we have the predicted y'.
While what we want is, to have (y'-y) as small as possible.

Since this is a classification problem, the y's domain is {0, 1}

As for cost function is defined:
Cost( h(x), y ) = 1/2 ( h(x) - y ) ^2

Ok, let's back to logistic regression:
Cost function would be:

logZ where Z = h(x)

So, if we predict correclty, i.e y=1 and h(x) = 1 predicted to be True,
Cost is 0. This is GOOD.

But! if we predict correclty, i.e y=1 and h(x) = 0 ,
Cost becomes infinite!!! This is not what we want!

This captures intuition that if h(x) = 0 (predict P(y=1 | x; ϴ) = 0),
but y=1, we'll penalize learning algorithm by a very large cost.

------------------
Retrospect:
1. 先找出linear regression function.

2. 找出一個wrapper function能將linear regression function 轉換成
以 0.5 為中界點 >= 0.5 為 probability 1 , < 0.5 為 probability 0

3. (1.) 為 h(x) , (2.) 為 g(h(x))

4. 以ϴs 為features. 找出ϴ.
當 (1.) >= 0時 (2.) >= 0.5 , 即 classify as y = 1
當 (1.) < 0時 (2.) < 0.5 , 即 classify as y = 0

5. 來找 Cost function. 所謂Cost function 就是預測的值與實際值的落差關係。
當預測值越接近實際值時, cost function的值越低。
當預測值越偏離實際值時, cost function的值越高, 甚至無限大。

6. 開始找 J(ϴ), 即所有cost的平方差和。

------------------
Simplified cost function and gradient descent:

Let's look J(ϴ):

Try to compress Cost function of y=1, y=0 into one function.
easy:
Cost() = -y * log(h(x)) - (1-y)log (1-h(x))

Thus, J(ϴ) becomes(this is _the_ standard fucntion for cases):
( principle of maximum likelihood estimation:
https://en.wikipedia.org/wiki/Maximum_likelihood )

What we need to do is to minimize J(ϴ).

Using Gradient descent:
The Gradient descent function is exact the same us before, just that,
h(x) has become g(h(x))

As for meaning of iteration is that, iteration means how many steps
performed on
ϴ = ϴ - alpha(J(ϴ)')

-------------------------------
Advanced Optimization:

Optimization algorithm:

Gradient descent
Conjugate gradient
BFGS
L-BFGS

[2:4] Traits:

don't need to pick the learning rate alpha
Often faster than gradient descent
But it's more complex.

Well, if have time, we could delve into those algorithms, it's fun.
But usually, just use the library to do it, man , we are software engineers,
not mathematics majors :-|

In octave:

octave:
PS1('>> ')
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@constFunction, initialTheta, options);

help fminunc;

-------------------
theta ∈ R(d dimention) d>=2

For logistic regression:

theta = [ϴ0, ϴ1, ϴ2, ϴ3, ..., ϴn]

function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(ϴ)];
gradient(1) = [code to compute J(ϴ0)'];
gradient(2) = [code to compute J(ϴ1)'];
...
gradient(n+1) = [code to compute J(ϴn)'];

---------------------------
Multiclass Classification:
i.e y=1, y=2, y=3, y=4 instead of y=0 and y=1

One-vs-All (one-vs-rest):
Idea:

pick 1 as the major positive , and the rest are the negative.
pick another 1 and repeat step (1.)
Maximize the h(x)

Suppose we have a multiclass classification problem with k classes( y ∈ {1, 2, ... , k})
there will be k different logistic regression classifiers we end up training.

Ataraxia through Epoché

Jan 17, 2016

[ML] Classification

No comments:

Post a Comment