Jan 10, 2016

[ML] Normal equation

Normal equation:

Idea:

X matrix design as follows:
N as features. And we have N + 1 features. i.e x0 = 1, due to mutiply with ϴ0.
M as number of samples.

        x0  x1  x2 ... xN
R1
R2
R3
.
.
.
RM

Thus meaning:
ϴ[transpose]X = y[transpose]
or
X[transpose]ϴ = y

In order to have ϴ, we do:

ϴ = (X[transpose]X)^-1 X[transpose] y
  1. Find the inverse of Feature Square Matrix.X[transpose]X becomes a Feature Square Matrix [Feature]*[Feature].
  2. The inverse of Feature Square Matrix then times X[transpose]become a [Feature]*[Training Set] matrix.
  3. y is a [Training Set]*1 matrix.
  4. Thus ϴ  is a [Feature] * 1 matrix.
So, the X is the design matrix will be:
[Training Set]*[Features]

So, the y is the matrix will be:
[Training Set]*1

So, the ϴ is the matrix will be:
[Features]*1




Octave for ϴ = (x[transpose]x)^-1 x[transpose] y:
# pinv(X'*X)*X'*y


Feature scaling:
While using normal equation methods, it's ok not to do feature scaling.
Reason? Because we are not using gradient descent method to find
the minimum value of J with ϴ.



So, which to choose? Gradient descent or Normal Equation?
    m as training examples, n as features.

Gradient descent:
disadvantage:
Need to choose learing rate α.
Needs many iterations. ( By iterations means that with all the examples 'm',
 we iterate with the same input of 'm' examples but with the tuning of ϴ)
Advantage:You run gradient descent for 15 iterations
Works well even n is large.

Normal equation:
Advantage:
No need to choose learning rate α.
Don't need to iterate.
disadvantage:
When n is large, computing (X[transpose]X)^-1 is HARD.
    into computing O(n^3), which means SLOW.
    While n >= 10000 , consider using gradient descent.
 

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.