DevOps | Cloud | Analytics | Open Source | Programming





Basic Assumptions In A Linear Regression Model



When we build a Supervised Machine Learning Regression Model , there are certain Basic Assumptions In A Linear Regression Model. These goes as part of the Model Building. Many beginners mostly don't get a clear picture as to what are those.  But it is a Best Practice to know and understand that No model is Perfect - everything works within certain limits and boundaries. This keeps us at control as regards to the practicality about where and when a specific model should (or should not) be used. The below pointers are pertaining to the Basic Assumptions used in a Linear Regression Model -

  • Basic Assumption in a Linear Regression model stands based on the term "Linear" itself. It means it is assumed that there is a linear and additive relationship between Target\Dependent\Response\Output variable and the Features\Independent\Input variable(s).
  • There should be no Multi-collinearity. What we mean by this is - the Feature variables used in the model should not be correlated. Normally Variance Inflation Factor (VIF) is the measure which is used to check Multi-collinearity. If the variables are correlated, it becomes extremely difficult for the model to determine the true effect of Input variables on the Output variable.
  • Residual Error is the difference between Actual and Predicted values. There should be no correlation between the Residual Errors.
  • The Residual Errors should have a Property called Homoskedasticity. Which means Residual errors should have a constant variance. When you plot the Residual errors against Predicted Values , basically it should give a relatively shapeless graph (without any clear patterns) and be generally symmetrical around the zero-line without particularly large residuals. Heteroskedasticity is the opposite of that (which means Residual error vs Predicted value graph shows a clear pattern e.g. Funnel-shape) which is not good . We don't want Heteroskedasticity.
  • The Residual Error terms must be normally distributed. We normally use Q-Q plot to check Normality. If the graph is a Straight line , it is a Happy case. If we see the plot as curved or distorted line , then it shows error terms are not Normally Distributed.
  Additional Read -