Tuesday, 17 May 2016

Regression ( Part - 1 )


Under machine learning there are two prominent categories of algorithms or technique :
Regression : Predict real-valued output
Classification : Predict discrete-valued output

What I am gonna discuss in this post is the Idea of regression and the various ways in which we can really picture that what is actually going on. This will be a long post ( Relatively :-D )



Regression is a form of technique which investigates the relationship between two or more variable. We try to find out a general pattern and try to predict the values. Regression is whole concept but to keep it simple lets first discuss Linear
Regression which also is known as Regression in one variable.


Linear Regression :


Let's say you have an input which when fed into a function gives you a certain output. With that function you can predict any value (output) for any given value (input).

Mathematically it can be defined as follows :

output=F(input)

Our task in regression is of predicting the output accurately. For that purpose first we need to get an approximate function F( ) so that when we give that function an input it can predict those desired value for us. One thing that must be noted is that the more our function is approximated accurately the more accurate our predictions.
I have used the word approximated because in real world data no variable can be stritly said to follow a specific pattern there can be noise, error or many other ambiguities.

Okay lets first understand the data we will be dealing with. 

Format of data : 

  • a variable X (Input) 
  • a variable y (Output). 
We have various instance of the X's and their corresponding y's. Our aim is to find a function best describing these patterns and which can also help us in predicting values for new X's.


Notation : 

data points : (x,y) instances in our data
n - number of data points
Lets say we have n data points - (x1,y1),(x2,y2),(x3,y3).........(xn,yn)
h(x) - predictor function

F' - value output by the function h(x) when x is an input to it.


Now the question of interest is how can we approximate the function.


( You will find this common in most of the algorithms so better memorise it )
Lets devide our work in tasks for better understanding :
Task 1 : Assume a hypothesis function which we want to approximate.
Task 2 : Find how well this hypothesis function performs.
Task 3 : Update the Approximated function appropriately and interate till either dataset is finished or the function conveges.


Task 1 : 


Under this section we assume our hypothesis fuinction.
lets say,
h(x)=(theta0)+(theta1)*(x)

This is equation of a straight line It is another reason it is called linear regression.

Task 2 : 


Under section how well hypothesis function performs : 

This thing can be done by using one of the value of X and checking error between corresponding output that we know and the one which we have predicted.
We us root mean squared error to check the performance of our predictor function.
Error= J(theta0,theta1) = ((h(x1)-y1)^2 +(h(x2)-y2)^2 +(h(x3)-y3)^2 +.......+(h(xn)-yn)^2)/(2*n)


Task 3 : 


What we need to do is to minimize this Error function.
For minimizing we first diffrentiate this function after substituting the actual form of h(x) and use it to update the value of parameters appropriately.

grad(theta0) = J'(theta0,theta1)/del(theta0) = (2*(theta0+theta1*(x1)-y1)+2*(theta0+theta1*(x2)-y2)+2*(theta0+theta1*(x3)-y3)+.......+2*(theta0+theta1*(xn)-yn))/(2*n)

grad(theta1) = J'(theta0,theta1)/del(theta1) = (2*(theta0+theta1*(x1)-y1)*x1+2*(theta0+theta1*(x2)-y2)*x2+2*(theta0+theta1*(x3)-y3)*x3+.......+2*(theta0+theta1*(xn)-yn)*xn)/(2*n)

using these as the updates to our parameters theta0 and theta1 with the steps which we can scale with the help of alpha knows as step.

So our update function becomes


Repeat till convergence {

(theta0)new = (theta0)old - alpha*(grad(theta0))

(theta1)new = (theta1)old - alpha*(grad(theta1))

}

Here alpha is learning rate. The greater the learning rate the faster the algorithm will converge.



Delimma while selecting alpha the learning rate:
  • if we choose small learning rate, slow convergence
  • when we choose large alpha, it overshoots
So we need to carefully select our learning rate alpha.

This is also known as the gradient descent algorithm.


After covergence we are left with the most apropriate parameters for the hypothesis function. Which we can use to predict the outcome.
I will put up the implemented approach of this technique in my next post ( Languages -  R and python) .





No comments:

Post a Comment