Jan 17, 2016

[ML] Classification

Classification:

Binary Class Modification:
Y ∈ {0, 1}
    0: Negative Class
    1: Positive Class
 
Multiclass modification:
Y ∈ {0, 1, 2 3 ...}

f(x) = {0, 1}


Linear regression method:
Use h(x) = ϴ[Transpose] X

Threshold classifier output: h(x) at 0.5:

if h(x) >= 0.5, predict "y=1"
else, predict "y=0"

However, if we extend the X's domain range, this won't work.

*Using linear regression won't be a good idea for classification problem.


Thus, we use:
Logistic Regression: It's a classification algorithm
0 <= h(x) <= 1

Logistic Regression Model:
Want 0 <= h(x) <= 1

h(x) = g(ϴ[Transpose] x)   ------------ (1)

Sigmoid function / Logistic function:
g(z) = 1 / (1 + e^(-z) )  ------------ (2)

Thus with (1) and (2)
=>
h(x)= 1 / ( 1  +  e^(-ϴ[Transpose] x)  )  



Interpretation of Hypothesis output:
h(x) = estimated probability that y=1 on input x

Example:
if x = [ x0; x1]  = [1; tumor size]

while h(x) = 0.7 with input features,
means that we have 70% the input is likely to be 1, i.e. True.

"probability that y=1, given x, parameterized by ϴ"
h(x) = P(y=1 | x;ϴ)
P(y=1 | x;ϴ) + P(y=0 | x;ϴ) = 1

Decision Boundary:
Suppose
predict "y=1" if h(x) >= 0.5
predict "y=0" if h(x) < 0.5

i.e:
When ϴ[Transpose] x) >= 0,
We have h(x) = g(ϴ[Transpose] x) >= 0.5

When ϴ[Transpose] x) < 0,
We have h(x) = g(ϴ[Transpose] x) < 0.5


Decision Boundary:
h(x) = g( ϴ0 + ϴ1x1 + ϴ2x2)

that is, if we want
y =1,   i.e   h(x) >= 0.5
i.e ϴ0 + ϴ1x1 + ϴ2x2 >= 0

e.g:
ϴ  = [-3;1;1]
ϴ[Transpose]x  = -3 + x1 + x2 >= 0
x1 + x2 >= 3





Paramater ϴ is to define the boundary.
How do we find ϴ, that is the question.

----------------------
Cost function:

Ok, for training set, we mean, giving the inpux x, there's 1:1 y.
This is already know.

By putting the x into h(x), we have the predicted y'.
While what we want is, to have (y'-y) as small as possible.

Since this is a classification problem, the y's domain is {0, 1}



As for cost function is defined:
Cost( h(x), y ) = 1/2 ( h(x) - y ) ^2

Ok, let's back to logistic regression:
Cost function would be:



logZ where Z = h(x)

So, if we predict correclty, i.e y=1 and  h(x) = 1 predicted to be True,
    Cost is 0. This is GOOD.

But! if we predict correclty, i.e y=1 and h(x) = 0 ,
    Cost becomes infinite!!! This is not what we want!
 
This captures intuition that if h(x) = 0 (predict P(y=1 | x; ϴ) = 0),
    but y=1, we'll penalize learning algorithm by a very large cost.



------------------
Retrospect:
1. 先找出linear regression function.

2. 找出一個wrapper function能將linear regression function 轉換成
        以 0.5 為中界點 >= 0.5 為 probability 1 , < 0.5 為 probability 0

3. (1.) 為 h(x) , (2.) 為 g(h(x))

4.  以ϴs 為features. 找出ϴ.
    當 (1.) >= 0時 (2.) >= 0.5 , 即 classify as y = 1
    當 (1.) < 0時 (2.) < 0.5 , 即 classify as y = 0
 
5. 來找 Cost function. 所謂Cost function 就是 預測的值 與 實際值 的落差關係。
 當預測值 越 接近 實際值時, cost function的值越低。
 當預測值 越 偏離 實際值時, cost function的值越高, 甚至無限大。

6. 開始找 J(ϴ), 即所有cost的平方差和。

------------------
Simplified cost function and gradient descent:

Let's look J(ϴ):

Try to compress Cost function of y=1, y=0 into one function.
easy:
Cost() = -y * log(h(x)) - (1-y)log (1-h(x))

Thus, J(ϴ) becomes(this is _the_ standard fucntion for cases):
( principle of maximum likelihood estimation:
https://en.wikipedia.org/wiki/Maximum_likelihood )



What we need to do is to minimize  J(ϴ).

Using Gradient descent:
The Gradient descent function is exact the same us before, just that,
h(x) has become g(h(x))



As for meaning of iteration is that, iteration means how many steps
performed on
ϴ = ϴ - alpha(J(ϴ)')

-------------------------------
Advanced Optimization:

Optimization algorithm:
  1. Gradient descent
  2. Conjugate gradient
  3. BFGS
  4. L-BFGS
[2:4] Traits:
  • don't need to pick the learning rate alpha
  • Often faster than gradient descent
  • But it's more complex.

Well, if have time, we could delve into those algorithms, it's fun.
But usually, just use the library to do it, man , we are software engineers,
not mathematics majors :-|


In octave:



octave:
PS1('>> ')
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);

[optTheta, functionVal, exitFlag] = fminunc(@constFunction, initialTheta, options);

help fminunc;

-------------------
theta ∈ R(d dimention)  d>=2

For logistic regression:

theta = [ϴ0, ϴ1, ϴ2, ϴ3, ..., ϴn]

function [jVal, gradient] = costFunction(theta)
    jVal = [code to compute J(ϴ)];
    gradient(1) = [code to compute J(ϴ0)'];
    gradient(2) = [code to compute J(ϴ1)'];
    ...
    gradient(n+1) = [code to compute J(ϴn)'];

---------------------------
Multiclass Classification:
i.e y=1, y=2, y=3, y=4 instead of y=0 and y=1

One-vs-All (one-vs-rest):
Idea:
  1. pick 1 as the major positive , and the rest are the negative.
  2. pick another 1 and repeat step (1.)
  3. Maximize the h(x)



Suppose we have a multiclass classification problem with k classes( y ∈ {1, 2, ... , k})
there will be k different logistic regression classifiers we end up training.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.