Binary Class Modification:
Y ∈ {0, 1}
0: Negative Class
1: Positive Class
Multiclass modification:
Y ∈ {0, 1, 2 3 ...}
f(x) = {0, 1}
Linear regression method:
Use h(x) = ϴ[Transpose] X
Threshold classifier output: h(x) at 0.5:
if h(x) >= 0.5, predict "y=1"
else, predict "y=0"
However, if we extend the X's domain range, this won't work.
*Using linear regression won't be a good idea for classification problem.
Thus, we use:
Logistic Regression: It's a classification algorithm
0 <= h(x) <= 1
Logistic Regression Model:
Want 0 <= h(x) <= 1
h(x) = g(ϴ[Transpose] x) ------------ (1)
Sigmoid function / Logistic function:
g(z) = 1 / (1 + e^(-z) ) ------------ (2)
Thus with (1) and (2)
=>
h(x)= 1 / ( 1 + e^(-ϴ[Transpose] x) )
Interpretation of Hypothesis output:
h(x) = estimated probability that y=1 on input x
Example:
if x = [ x0; x1] = [1; tumor size]
while h(x) = 0.7 with input features,
means that we have 70% the input is likely to be 1, i.e. True.
"probability that y=1, given x, parameterized by ϴ"
h(x) = P(y=1 | x;ϴ)
P(y=1 | x;ϴ) + P(y=0 | x;ϴ) = 1
Decision Boundary:
Suppose
predict "y=1" if h(x) >= 0.5
predict "y=0" if h(x) < 0.5
i.e:
When ϴ[Transpose] x) >= 0,
We have h(x) = g(ϴ[Transpose] x) >= 0.5
When ϴ[Transpose] x) < 0,
We have h(x) = g(ϴ[Transpose] x) < 0.5
Decision Boundary:
h(x) = g( ϴ0 + ϴ1x1 + ϴ2x2)
that is, if we want
y =1, i.e h(x) >= 0.5
i.e ϴ0 + ϴ1x1 + ϴ2x2 >= 0
e.g:
ϴ = [-3;1;1]
ϴ[Transpose]x = -3 + x1 + x2 >= 0
x1 + x2 >= 3
Paramater ϴ is to define the boundary.
How do we find ϴ, that is the question.
----------------------
Cost function:
Ok, for training set, we mean, giving the inpux x, there's 1:1 y.
This is already know.
By putting the x into h(x), we have the predicted y'.
While what we want is, to have (y'-y) as small as possible.
Since this is a classification problem, the y's domain is {0, 1}
As for cost function is defined:
Cost( h(x), y ) = 1/2 ( h(x) - y ) ^2
Ok, let's back to logistic regression:
Cost function would be:
logZ where Z = h(x)
So, if we predict correclty, i.e y=1 and h(x) = 1 predicted to be True,
Cost is 0. This is GOOD.
But! if we predict correclty, i.e y=1 and h(x) = 0 ,
Cost becomes infinite!!! This is not what we want!
This captures intuition that if h(x) = 0 (predict P(y=1 | x; ϴ) = 0),
but y=1, we'll penalize learning algorithm by a very large cost.
------------------
Retrospect:
1. 先找出linear regression function.
2. 找出一個wrapper function能將linear regression function 轉換成
以 0.5 為中界點 >= 0.5 為 probability 1 , < 0.5 為 probability 0
3. (1.) 為 h(x) , (2.) 為 g(h(x))
4. 以ϴs 為features. 找出ϴ.
當 (1.) >= 0時 (2.) >= 0.5 , 即 classify as y = 1
當 (1.) < 0時 (2.) < 0.5 , 即 classify as y = 0
5. 來找 Cost function. 所謂Cost function 就是 預測的值 與 實際值 的落差關係。
當預測值 越 接近 實際值時, cost function的值越低。
當預測值 越 偏離 實際值時, cost function的值越高, 甚至無限大。
6. 開始找 J(ϴ), 即所有cost的平方差和。
------------------
Simplified cost function and gradient descent:
Let's look J(ϴ):
Try to compress Cost function of y=1, y=0 into one function.
easy:
Cost() = -y * log(h(x)) - (1-y)log (1-h(x))
Thus, J(ϴ) becomes(this is _the_ standard fucntion for cases):
( principle of maximum likelihood estimation:
https://en.wikipedia.org/wiki/Maximum_likelihood )
What we need to do is to minimize J(ϴ).
Using Gradient descent:
The Gradient descent function is exact the same us before, just that,
h(x) has become g(h(x))
As for meaning of iteration is that, iteration means how many steps
performed on
ϴ = ϴ - alpha(J(ϴ)')
-------------------------------
Advanced Optimization:
Optimization algorithm:
- Gradient descent
- Conjugate gradient
- BFGS
- L-BFGS
- don't need to pick the learning rate alpha
- Often faster than gradient descent
- But it's more complex.
Well, if have time, we could delve into those algorithms, it's fun.
But usually, just use the library to do it, man , we are software engineers,
not mathematics majors :-|
In octave:
octave:
PS1('>> ')
options = optimset('GradObj', 'on', 'MaxIter', '100');
initialTheta = zeros(2,1);
[optTheta, functionVal, exitFlag] = fminunc(@constFunction, initialTheta, options);
help fminunc;
-------------------
theta ∈ R(d dimention) d>=2
For logistic regression:
theta = [ϴ0, ϴ1, ϴ2, ϴ3, ..., ϴn]
function [jVal, gradient] = costFunction(theta)
jVal = [code to compute J(ϴ)];
gradient(1) = [code to compute J(ϴ0)'];
gradient(2) = [code to compute J(ϴ1)'];
...
gradient(n+1) = [code to compute J(ϴn)'];
---------------------------
Multiclass Classification:
i.e y=1, y=2, y=3, y=4 instead of y=0 and y=1
One-vs-All (one-vs-rest):
Idea:
- pick 1 as the major positive , and the rest are the negative.
- pick another 1 and repeat step (1.)
- Maximize the h(x)
Suppose we have a multiclass classification problem with k classes( y ∈ {1, 2, ... , k})
there will be k different logistic regression classifiers we end up training.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.