01.Linear Regression

HaiyueAugust 2, 2023About 4 min

Definition of Regression

Regression is a statistical method used in finance, investing, and other disciplines that attempts to determine the strength and character of the relationship between one dependent variable (usually denoted by Y) and a series of other variables (known as independent variables).

Understanding Regression

Regression captures the correlation between variables observed in a data set and quantifies whether those correlations are statistically significant or not.

The two basic types of regression are simple linear regression and multiple linear regression, although there are non-linear regression methods for more complicated data and analysis. Simple linear regression uses one independent variable to explain or predict the outcome of the dependent variable Y, while multiple linear regression uses two or more independent variables to predict the outcome (while holding all others constant).

Regression can help finance and investment professionals as well as professionals in other businesses. Regression can also help predict sales for a company based on weather, previous sales, GDP growth, or other types of conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering the costs of capital.

Linear Regression

Linear regression establishes the linear relationship between two variables based on a line of best fit.

Line of Best Fit

Line of best fit refers to a line through a scatter plot of data points that best expresses the relationship between those points. Statisticians typically use the least squares method (sometimes known as ordinary least squares, or OLS) to arrive at the geometric equation for the line, either through manual calculations or by using software.

Key Takeaway

A line of best fit is a straight line that minimizes the distance between it and some data.
The line of best fit is used to express a relationship in a scatter plot of different data points.
It is an output of regression analysis and can be used as a prediction tool for indicators and price movements.
In finance, the line of best fit is used to identify trends or correlations in market returns between assets or over time.

The hypothesis for the linear regression

There must be a linear relationship between the independent variable and the dependent variable
- Evaluating using scatter plot/lowess chart
- Could using other methods to do the relationship if not fullfil the hypothesis
  - Spline: piecewise
  - Quadratic: $a\cdot x^2 + b\cdot x$
  - Cubic: $a\cdot x^3 + b\cdot x^2 + c\cdot x$
  - Restricted cubic: constrained to be linear in the two tails
  - Restricted cubic spline: a way of testing the hypothesis that the relationship is not linear or summarizing a relationship that is too non-linear to be usefully summarized by a linear relationship.
  - etc.
Independent observations
- Evaluating using Durbin-Watson
- Normally, it should be satisfied during design stage.
- For dependent observations could using GEE model or Multi-level model.
There are no significant outliers
- Evaluating using Boxplot or violin plot
Equal variance
- Evaluating using residual-versus-fitted plot
Regression residuals are approximately normally distributed
- Evaluating using histogram/qqplot/skewness/kurtosis/shapiro-wilk/shapiro-francia

The basic definition

Residual: The residual for each observation is the difference between predicted values of y (dependent variable) and observed values of y.

Intercept:
In Maths, an intercept is a point on the y-axis, through which the slope of the line passes. It is the y-coordinate of a point where a straight line or a curve intersects the y-axis. This is represented when we write the equation for a line, y = mx+c, where m is slope and c is the y-intercept.

More INFO

There are basically two intercepts, x-intercept and y-intercept. The point where the line crosses the x-axis is the x-intercept and the point where the line crosses the y-axis is the y-intercept. In this article, you will learn what is the intercept, how to find the intercept for a given line, graphing intercepts along with solved examples.

Sum of squares(SS):
Sum of squares (SS) is a statistical tool that is used to identify the dispersion of data as well as how well the data can fit the model in regression analysis. The sum of squares got its name because it is calculated by finding the sum of the squared differences.

Mean Squared Errors(MS)
Mean Squared Errors (MS) — are the mean of the sum of squares or the sum of squares divided by the degrees of freedom for both, regression and residuals.

Regression MS $=\frac{\sum(\hat{y}-\bar{y})^2}{Reg.df}$
Residual MS $=\frac{\sum(y-\hat{y})^2}{Reg.df}$

:::
$\hat{y}$ : an estimator or an estimated value.
$\bar{y}$ : sample mean of a random variable.
Degrees of freedom: Degrees of freedom are the number of independent values that a statistical analysis can estimate.

F: F — is used to test the hypothesis that the slope of the independent variable is zero. Mathematically, it can also be calculated as
$F = \frac{\text{Regression MS}}{\text{Residual MS}}$

Calculation

Linear regression models often use a least-squares approach to determine the line of best fit. The least-squares technique is determined by minimizing the sum of squares created by a mathematical function. A square is, in turn, determined by squaring the distance between a data point and the regression line or mean value of the data set.

Once this process has been completed (usually done today with software), a regression model is constructed. The general form of each type of regression model is:

Simple linear regression:

$Y=a+bX+\mu$

Multiple linear regression:

$Y=a+b_1X_1+b_2X_2 + b_3X_3+...+b_tX_t+\mu$
where

$Y= \text{The dependent variable you are trying to predict or explain}$
$X= \text{The explanatory (independent) variable(s) you are using to predict or associate with }Y$
$a = \text{The y-intercept}$
$b= \text{(beta coefficient) is the slope of the explanatory variable(s)}$
$u= \text{The regression residual or error term}$