## Ordinary Least Squares (OLS)

### Introduction

Regression analysis is a statistical technique used to fit a model expressed in terms of one or more variables to some data. In particular, it allows one to analyze the relationship of a dependent variable (also referred to as the regressand) on one or more independent or predictor variables (also referred to as regressors), and assess how influential each of these are.

There are many types of regression analysis techniques, however one of the most commonly used is based on fitting data to a linear model, and in particular using an approach called Least Squares. The Morpheus API currently supports 3 variations of Linear Least Squares regression, namely Ordinary Least Squares (OLS), Weighted Least Squares (WLS) and Generalized Least Squares (GLS). The following section reviews some OLS regression theory, and provides an example of how to use the Morhpeus API to apply this technique.

### Theory

A linear regression model in matrix form can be expressed as:

$$ Y = X \beta + \epsilon \ \ \ \ where \ E[\epsilon]=0 \ \ \ \ Var[\epsilon] = \sigma^2 I $$

Y represents an `nx1`

vector of regressands, and X represents an `nxp`

design matrix, where `n`

is the number of observations in the data,
and `p`

represents the number of parameters to estimate. If the model includes an intercept term, the first column in the design matrix is
populated with 1's and therefore the first entry in the `px1`

\(\beta\) vector would represent the intercept value. The `nx1`

\(\epsilon\)
vector represents the error or disturbance term, which is assumed to have a conditional mean of 0 and also to be free of any serial correlation
(see section below on the Gauss Markov assumptions).

A least squares regression model estimates \(\beta\) so as to minimize the sum of the squared error, or \(\epsilon^T\epsilon\). We can use the model equation above to express \(\epsilon^T\epsilon\) in terms of Y and X\(\beta\), differentiate this by \(\beta\), set the result to zero since we wish to minimize the squared error, and solve for \(\beta\) as follows:

$$ \begin{align} \epsilon^T\epsilon &= (Y - X \beta)^T(Y - X \beta) \\ &= Y^TY - \beta^TX^TY - Y^TX\beta + \beta^T X^T X \beta \\ &= Y^TY - 2\beta^T X^T Y + \beta^T X^T X \beta \end{align} $$

We now differentiate this expression with respect to \(\beta\) and set it to zero which yields the following:

$$ \frac{d\epsilon^T\epsilon}{d\beta} = -2X^T Y + 2X^T X \beta = 0 $$

This expression can be re-arranged to solve for \(\beta\) and yields the following equation:

$$ \beta = (X^T X)^{-1} X^T Y $$

The value of \(\beta\) can only be **estimated** given some sample data drawn from a population or data generating process, the *true* value is
unknown. We usually refer to the estimate as \(\hat{\beta}\) (beta "hat") in order to distinguish it from the true population value. In addition,
when expressing the model in terms of \(\hat{\beta}\), the stochastic term is referred to as *residuals* rather than *errors*, and while they
are conceptually related, they are not the same thing. Errors represent the deviation of
observations from the unknown population line, while residuals represent the deviations from the estimation line, a subtle difference and easily
confused. In terms of \(\hat{\beta}\) and residuals **u**, the model is expressed in the usual way.

$$ Y = X \hat{\beta} + u \ \ \ \ where \ E[u]=0 \ \ \ \ Var[u] = \sigma^2 I $$

The residuals are of course still assumed to be consistent with the Gauss Markov assumptions, and so the estimator for \(\hat{\beta}\) is:

$$ \hat{\beta} = (X^T X)^{-1} X^T Y $$

### Solving

We can solve for \(\hat{\beta}\) directly by calculating the right hand side of the above equation, which is what the Morpheus library does by default. There are situations in which this can present numerical stability issues, in which case it may be preferable to solve for \(\hat{\beta}\) by factorizing the design matrix X using a QR decomposition where Q represents an orthogonal matrix such that \(Q^T Q = I\) and thus \(Q^T = Q^{-1}\), and where R is an upper triangular matrix. The QR decomposition approach is based on the following reasoning.

$$ \begin{align} X^T Y & = X^T X \hat{\beta} \\ \big( QR \big)^T Y & = \big( QR \big)^T \big( QR \big) \hat{\beta} \\ R^T Q^T Y & = R^T Q^T Q R \hat{\beta} \\ R^T Q^T Y & = R^T R \hat{\beta} \\ (R^T)^{-1} R^T Q^T Y & = \big( R^T \big)^{-1} R^T R \hat{\beta} \\ Q^T Y & = R \hat{\beta} \end{align} $$

This can be solved efficiently by backward substitution because the R
matrix is upper triangular, and therefore no inversion of the design matrix is required. The Morpheus
library also supports estimating \(\hat{\beta}\) using this technique, which can be configured on a case by case basis via the `withSolver()`

method
on the `DataFrameLeastSquares`

interface.

### Diagnostics

Just like any other statistical technique, regression analysis is susceptible to sampling error, and it is therefore common to compute the variance of the parameter estimates, as well as their standard error, which can then be used for inferential purposes. In the context of a linear regression, the null hypotheses \(H_{0}\), is generally that the model parameters are zero. The parameter standard errors can be used to calculate a t-statistic and corresponding p-value in order to decide if \(H_{0}\) can be rejected in favour of the alternative hypothesis, and thus assert that the estimates are statistically significantly different from zero.

The variance of \(\hat{\beta}\) with respect to the true population value can be expressed as follows:

$$ Var(\hat{\beta}) = E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T] $$

We substitute our population model \( Y = X \beta + \epsilon \) in our sample estimator \( \hat{\beta} = (X^T X)^{-1} X^T Y \) as follows:

$$ \begin{align} \hat{\beta} &= (X^T X)^{-1} X^T ( X \beta + \epsilon ) \\ \hat{\beta} &= (X^T X)^{-1} X^T X \beta + (X^T X)^{-1} X^T \epsilon \\ \hat{\beta} &= \beta + (X^T X)^{-1} X^T \epsilon \\ \hat{\beta} - \beta &= (X^T X)^{-1} X^T \epsilon \\ \end{align} $$

With the above expression for \( \hat{\beta} - \beta \) we can now solve for the variance of the OLS estimator as follows:

$$ \begin{align} Var(\hat{\beta}) &= E[(\hat{\beta} - \beta)(\hat{\beta} - \beta)^T] \\ & = E[((X^T X)^{-1} X^T \epsilon) ((X^T X)^{-1} X^T \epsilon)^T] \\ & = E[(X^T X)^{-1} X^T \epsilon \epsilon^T X (X^T X)^{-1} ] \end{align} $$

Given that the design matrix X is non-stochastic and the \( E[\epsilon \epsilon^T] = \sigma^2 I \):

$$ \begin{align} Var(\hat{\beta}) & = (X^T X)^{-1} X^T E[\epsilon \epsilon^T] X (X^T X)^{-1} \\ & = (X^T X)^{-1} X^T (\sigma^2 I) X (X^T X)^{-1} \\ & = \sigma^2 I (X^T X)^{-1} X^T X (X^T X)^{-1} \\ & = \sigma^2 I (X^T X)^{-1} \\ & = \sigma^2 (X^T X)^{-1} \end{align} $$

Other regression diagnostics that are calculated include the coefficient of determination or \(R^2\) which is a number that indicates the proportion of variance in the dependent variable that is explained by the independent variables, and is documented in the table below. The parameter variance estimates as calculated above are used to compute the standard errors and a corresponding t-statistic which can be used for statistical inference.

Quantity | Description |
---|---|

Residual Sum of Squares (RSS) | $$ RSS = \sum_{i=1}^n \big(y_{i} - \hat{y_{i}} \big)^2 = \sum_{i=1}^n \epsilon_{i}^2 = \epsilon^T \epsilon $$ |

Total Sum of Squares (TSS) | $$ TSS = \sum_{i=1}^{n} \big(y_{i} - \overline{y}\big)^2 \, \textrm{where} \; \overline{y} = \frac{1}{n} \sum_{i=1}^n y_{i} $$ |

Explained Sum of Squares (ESS) | $$ ESS = \sum_{i=1}^{n} \big(\hat{y_{i}} - \overline{y}\big)^2 \, \textrm{where} \; \overline{y} = \frac{1}{n} \sum_{i=1}^n y_{i} $$ |

R-Squared | $$ R^2 = 1 - \frac{RSS}{TSS} $$ |

R-Squared (Adjusted) | $$ R^2_{adj} = 1 - \frac{ RSS * \big( n - 1 \big) }{ TSS * \big( n - p \big)} $$ |

Regression Standard Error | $$ SE = \sqrt{ \frac{RSS}{ n - p }} $$ |

Parameter Variance | $$ Var(\hat{\beta}) = SE^2( X^T X )^{-1} $$ |

Parameter Std Errors | $$ SE(\hat{\beta_{i}}) = \sqrt{ Var(\hat{\beta_{i}})} $$ |

Parameter T-statistics | $$ T_{\beta_{i}} = \frac{\hat{ \beta_{i}}}{ SE( \hat{ \beta_{i} } ) } $$ |

### Gauss Markov Assumptions

The Gauss Markov Theorem states that Ordinary Least Squares is the Best Linear Unbiased and Efficient (BLUE) estimator of \(\beta\), conditional on a certain set of assumptions being met. In this context, "best" means that there are no other unbiased estimators with a smaller sampling variance than OLS. Unbiased means that that the expectation of \( \hat{\beta}\) is equal to the population \(\beta\), or otherwise stated \( E[ \hat{\beta} ] = \beta \). The assumptions that must hold for OLS to be BLUE are as follows:

Assumptions | Description |
---|---|

Assumption 1 | The regression model is linear in parameters, and therefore well specified |

Assumption 2 | The regressors are linearly independent, and therefore do not exhibit perfect multicollinearity |

Assumption 3 | The errors in the regression have a conditional mean of zero |

Assumption 4 | The errors are homoscedastic, which means they exhibit constant variance |

Assumption 5 | The errors are uncorrelated between observations |

Assumption 6 | The errors are normally distributed, and independent and identically distributed (iid) |

** Linear in Parameters **

The first assumption regarding linearity suggests that the dependent variable is a linear function of the independent variables. This does not imply that there is a linear relationship between the independent and dependent variables, it only states the the model is linear in parameters. For example, a model of the form \(y = \alpha + \beta x^2 \) qualifies as being linear in parameters, while \(y = \alpha + \beta^2 x \) does not. If the functional form of a model under investigation is not linear in parameters, it can often be transformed so as to render it linear.

** Linearly Independent **

The second assumption that there is no perfect multicollinearity between the regressors is important, as if it exists, the OLS estimator cannot be calculated. Another way of expressing this condition is that one of the independent variables cannot be a function of any of the other independent variables, and therefore the design matrix X must be non-singular, and therefore have full rank.

** Strict exogeneity **

The third assumption above states that the disturbance term averages out to zero for any given instance of X, which implies that no observations of the independent variables convey any information about the error. Mathematically this is stated as \( E[ \epsilon | X ] = 0 \). This assumption is violated if the independent variables are stochastic in nature, which can arise as a result of measurement error, or if there is endogeneity in the model.

** Spherical Errors **

The fourth and fifth assumptions relate to the covariance matrix of the error term, and specifically states that \(E[ \epsilon \epsilon^T | X] = \sigma^2 I \). There are two key concepts embedded in this statement, the first is that the disturbance term has uniform variance of \(\sigma^2\) regardless of the values of the independent variables, and is referred to as homoscedasticity. In addition, the off diagonal terms of the covariance matrix are assumed to be zero, which suggests there is no serial correlation between errors. Either or both of these assumptions may not hold in the real world, in which case a WLS or GLS estimator may prove to be a better, unbiased linear estimator.

### Example

The Morpheus API defines an interface called `DataFrameRegression`

which exposes a number of methods that support different linear regression techniques, namely
OLS, WLS and GLS. There are overloaded methods that take one or more regressors in order to conveniently support simple and multiple linear regression.

The regression interface, which operates on the column data in a `DataFrame`

, can be accessed by calling the `regress()`

method on the frame. If a regression in
the row dimension is required, simply call `transpose()`

on the frame before calling `regress()`

.

To illustrate an example, consider the same motor vehicle dataset introduced earlier, which can be loaded with the code below. The first 10 rows of this `DataFrame`

is also included for inspection, and in this exercise we are going to be interested in the **EngineSize** and **Horsepower** columns.

```
static DataFrame<Integer,String> loadCarDataset() {
return DataFrame.read().csv(options -> {
options.setResource("http://zavtech.com/data/samples/cars93.csv");
options.setExcludeColumnIndexes(0);
});
}
```

Index | Manufacturer | Model | Type | Min.Price | Price | Max.Price | MPG.city | MPG.highway | AirBags | DriveTrain | Cylinders | EngineSize | Horsepower | RPM | Rev.per.mile | Man.trans.avail | Fuel.tank.capacity | Passengers | Length | Wheelbase | Width | Turn.circle | Rear.seat.room | Luggage.room | Weight | Origin | Make | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 0 | Acura | Integra | Small | 12.9000 | 15.9000 | 18.8000 | 25 | 31 | None | Front | 4 | 1.8000 | 140 | 6300 | 2890 | Yes | 13.2000 | 5 | 177 | 102 | 68 | 37 | 26.5 | 11 | 2705 | non-USA | Acura Integra | 1 | Acura | Legend | Midsize | 29.2000 | 33.9000 | 38.7000 | 18 | 25 | Driver & Passenger | Front | 6 | 3.2000 | 200 | 5500 | 2335 | Yes | 18.0000 | 5 | 195 | 115 | 71 | 38 | 30 | 15 | 3560 | non-USA | Acura Legend | 2 | Audi | 90 | Compact | 25.9000 | 29.1000 | 32.3000 | 20 | 26 | Driver only | Front | 6 | 2.8000 | 172 | 5500 | 2280 | Yes | 16.9000 | 5 | 180 | 102 | 67 | 37 | 28 | 14 | 3375 | non-USA | Audi 90 | 3 | Audi | 100 | Midsize | 30.8000 | 37.7000 | 44.6000 | 19 | 26 | Driver & Passenger | Front | 6 | 2.8000 | 172 | 5500 | 2535 | Yes | 21.1000 | 6 | 193 | 106 | 70 | 37 | 31 | 17 | 3405 | non-USA | Audi 100 | 4 | BMW | 535i | Midsize | 23.7000 | 30.0000 | 36.2000 | 22 | 30 | Driver only | Rear | 4 | 3.5000 | 208 | 5700 | 2545 | Yes | 21.1000 | 4 | 186 | 109 | 69 | 39 | 27 | 13 | 3640 | non-USA | BMW 535i | 5 | Buick | Century | Midsize | 14.2000 | 15.7000 | 17.3000 | 22 | 31 | Driver only | Front | 4 | 2.2000 | 110 | 5200 | 2565 | No | 16.4000 | 6 | 189 | 105 | 69 | 41 | 28 | 16 | 2880 | USA | Buick Century | 6 | Buick | LeSabre | Large | 19.9000 | 20.8000 | 21.7000 | 19 | 28 | Driver only | Front | 6 | 3.8000 | 170 | 4800 | 1570 | No | 18.0000 | 6 | 200 | 111 | 74 | 42 | 30.5 | 17 | 3470 | USA | Buick LeSabre | 7 | Buick | Roadmaster | Large | 22.6000 | 23.7000 | 24.9000 | 16 | 25 | Driver only | Rear | 6 | 5.7000 | 180 | 4000 | 1320 | No | 23.0000 | 6 | 216 | 116 | 78 | 45 | 30.5 | 21 | 4105 | USA | Buick Roadmaster | 8 | Buick | Riviera | Midsize | 26.3000 | 26.3000 | 26.3000 | 19 | 27 | Driver only | Front | 6 | 3.8000 | 170 | 4800 | 1690 | No | 18.8000 | 5 | 198 | 108 | 73 | 41 | 26.5 | 14 | 3495 | USA | Buick Riviera |

Based on our understanding of the factors that influence the power of an internal combustion engine, one might hypothesize that there is a positive and linear
relationship between the size of an engine and how much horsepower it produces. . We can use the car dataset to test this hypothesis. First we can use the Morpheus
library to generate a scatter plot of the data, with **EngineSize** on the x-axis and **Horsepower** on the y-axis as follows:

```
final String y = "Horsepower";
final String x = "EngineSize";
final DataFrame<Integer,String> frame = loadCarDataset();
final DataFrame<Integer,String> xy = frame.cols().select(y, x);
Chart.create().withScatterPlot(xy, false, x, chart -> {
chart.title().withText(y + " vs " + x);
chart.plot().style(y).withColor(Color.RED);
chart.plot().style(y).withPointsVisible(true).withPointShape(ChartShape.DIAMOND);
chart.plot().axes().domain().label().withText(x);
chart.plot().axes().domain().format().withPattern("0.00;-0.00");
chart.plot().axes().range(0).label().withText(y);
chart.plot().axes().range(0).format().withPattern("0;-0");
chart.show(845, 450);
});
```

The scatter plot certainly appears to suggest that there is a positive relationship between **EngineSize** and **Horsepower**. In addition, it seems somewhat
linear, however the dispersion appears to get more significant for larger engine sizes, which would be a violation of one of the Gauss Markov
assumptions, namely of homoscedastic errors. Nevertheless, let us proceed to regress **Horsepower** on
**EngineSize** and see what the results look like. The code below runs a single variable regression and simply prints the model results to standard-out
for inspection.

```
DataFrame<Integer,String> frame = loadCarDataset();
String regressand = "Horsepower";
String regressor = "EngineSize";
frame.regress().ols(regressand, regressor, true, model -> {
System.out.println(model);
return Optional.empty();
});
```

============================================================================================== Linear Regression Results ============================================================================================== Model: OLS R-Squared: 0.5360 Observations: 93 R-Squared(adjusted): 0.5309 DF Model: 1 F-Statistic: 105.1204 DF Residuals: 91 F-Statistic(Prob): 1.11E-16 Standard Error: 35.8717 Runtime(millis) 48 Durbin-Watson: 1.9591 ============================================================================================== Index | PARAMETER | STD_ERROR | T_STAT | P_VALUE | CI_LOWER | CI_UPPER | ---------------------------------------------------------------------------------------------- Intercept | 45.2195 | 10.3119 | 4.3852 | 3.107E-5 | 24.736 | 65.7029 | EngineSize | 36.9633 | 3.6052 | 10.2528 | 7.573E-17 | 29.802 | 44.1245 | ==============================================================================================

The regression results yield a slope coefficient of **36.96**, suggesting that for every additional litre of engine capacity, we can expect to add another
36.96 horsepower. While the p-value associated with the slope coefficient suggests that it is statistically
significantly different from zero, it does not tell us about how relevant the parameter is in the regression. In this case we can reasonably surmise
that engine size is relevant given our understanding of how an internal combustion engine works and what factors affect output power. Having said that,
the coefficient of determination is perhaps lower than one might expect, and the heteroscedasticity of the residuals provides some hint that the model
may be incomplete. In particular, omitted-variable bias may be at play here, in the sense that
there are other important factors that influence an engine's horsepower that we have not included.

While the code example above simply prints the model results to standard out, the illustration below demonstrates how to access all the relevant model
outputs via the API. The `ols()`

method takes a lambda parameter that consumes the regression model, which is an instance of the `DataFrameLeastSquares`

interface, and provides all the relevant hooks to access the model inputs and outputs.

```
final String regressand = "Horsepower";
final String regressor = "EngineSize";
final DataFrame<Integer,String> frame = loadCarDataset();
frame.regress().ols(regressand, regressor, true, model -> {
assert (model.getRegressand().equals(regressand));
assert (model.getRegressors().size() == 1);
assertEquals(model.getRSquared(), 0.5359992996664269, 0.00001);
assertEquals(model.getRSquaredAdj(), 0.5309003908715525, 0.000001);
assertEquals(model.getStdError(), 35.87167658782274, 0.00001);
assertEquals(model.getFValue(), 105.120393642, 0.00001);
assertEquals(model.getFValueProbability(), 0, 0.00001);
assertEquals(model.getBetaValue("EngineSize", Field.PARAMETER), 36.96327914, 0.0000001);
assertEquals(model.getBetaValue("EngineSize", Field.STD_ERROR), 3.60518041, 0.0000001);
assertEquals(model.getBetaValue("EngineSize", Field.T_STAT), 10.25282369, 0.0000001);
assertEquals(model.getBetaValue("EngineSize", Field.P_VALUE), 0.0000, 0.0000001);
assertEquals(model.getBetaValue("EngineSize", Field.CI_LOWER), 29.80203113, 0.0000001);
assertEquals(model.getBetaValue("EngineSize", Field.CI_UPPER), 44.12452714, 0.0000001);
assertEquals(model.getInterceptValue(Field.PARAMETER), 45.21946716, 0.0000001);
assertEquals(model.getInterceptValue(Field.STD_ERROR), 10.31194906, 0.0000001);
assertEquals(model.getInterceptValue(Field.T_STAT), 4.3851523, 0.0000001);
assertEquals(model.getInterceptValue(Field.P_VALUE), 0.00003107, 0.0000001);
assertEquals(model.getInterceptValue(Field.CI_LOWER), 24.73604714, 0.0000001);
assertEquals(model.getInterceptValue(Field.CI_UPPER), 65.70288719, 0.0000001);
System.out.println(model);
return Optional.of(model);
});
```

Finally, the chart below adds the OLS trendline to the initial scatter plot to get a better sense of how the solution fits the data.

The code to generate this chart is as follows:

```
final String regressand = "Horsepower";
final String regressor = "EngineSize";
final DataFrame<Integer,String> frame = loadCarDataset();
final DataFrame<Integer,String> xy = frame.cols().select(regressand, regressor);
Chart.create().withScatterPlot(xy, false, regressor, chart -> {
chart.title().withFont(new Font("Verdana", Font.BOLD, 16));
chart.title().withText(regressand + " regressed on " + regressor);
chart.subtitle().withText("Single Variable Linear Regression");
chart.plot().style(regressand).withColor(Color.RED);
chart.plot().trend(regressand).withColor(Color.BLACK);
chart.plot().axes().domain().label().withText(regressor);
chart.plot().axes().domain().format().withPattern("0.00;-0.00");
chart.plot().axes().range(0).label().withText(regressand);
chart.plot().axes().range(0).format().withPattern("0;-0");
chart.show();
});
```

### Unbiasedness

An Ordinary Least Squares estimator is said to be unbiased in the sense that
if you run regressions on many samples of data generated from the same population process, the coefficient estimates from all these samples
would be centered on the true population values. To demonstrate this empirically, we can define a population process in 2D space with a known
slope and intercept coefficient, and then proceed to generate many samples from this process while adding Gaussian noise to the dependent variable
in order to simulate the error term. The code below defines a data generating function that returns a `DataFrame`

of X and Y values initialized
from the population coefficients, while adding white noise scaled according to the standard deviation specified by the `sigma`

parameter.

```
/**
* Returns a 2D sample dataset based on a population process using the coefficients provided
* @param alpha the intercept term for population process
* @param beta the slope term for population process
* @param startX the start value for independent variable
* @param stepX the step size for independent variable
* @param sigma the variance to add noise to dependent variable
* @param n the size of the sample to generate
* @return the frame of XY values
*/
DataFrame<Integer,String> sample(double alpha, double beta, double startX, double stepX, double sigma, int n) {
final Array<Double> xValues = Array.of(Double.class, n).applyDoubles(v -> startX + v.index() * stepX);
final Array<Double> yValues = Array.of(Double.class, n).applyDoubles(v -> {
final double yfit = alpha + beta * xValues.getDouble(v.index());
return new NormalDistribution(yfit, sigma).sample();
});
final Array<Integer> rowKeys = Range.of(0, n).toArray();
return DataFrame.of(rowKeys, String.class, columns -> {
columns.add("X", xValues);
columns.add("Y", yValues);
});
}
```

To get a sense of the nature of the dataset generated by this function for some chosen set of parameters, we can plot a number of samples. The code below plots 4 random samples of this population process with beta = 1.45, alpha = 4.15 and a sigma value of 20.

```
final double beta = 1.45d;
final double alpha = 4.15d;
final double sigma = 20d;
Chart.show(2, IntStream.range(0, 4).mapToObj(i -> {
DataFrame<Integer,String> frame = sample(alpha, beta, 0, 1, sigma, 100);
String title = "Sample %s Dataset, Beta: %.2f Alpha: %.2f";
String subtitle = "Parameter estimates, Beta^: %.3f, Alpha^: %.3f";
DataFrameLeastSquares<Integer,String> ols = frame.regress().ols("Y", "X", true, Optional::of).get();
double betaHat = ols.getBetaValue("X", DataFrameLeastSquares.Field.PARAMETER);
double alphaHat = ols.getInterceptValue(DataFrameLeastSquares.Field.PARAMETER);
return Chart.create().withScatterPlot(frame, false, "X", chart -> {
chart.title().withText(String.format(title, i, beta, alpha));
chart.title().withFont(new Font("Arial", Font.BOLD, 14));
chart.subtitle().withText(String.format(subtitle, betaHat, alphaHat));
chart.plot().style("Y").withColor(Color.RED).withPointsVisible(true);
chart.plot().trend("Y");
});
}));
```

Given this data generating function, we can produce many samples from a known population process and then proceed to run OLS regressions on these samples. For each run we capture the coefficient estimates, and then plot a histogram of all the recorded estimates to confirm that the coefficients are indeed centered on the known population values. The following code performs this procedure for 100,000 regressions, and is followed by the resulting plots.

```
final int n = 100;
final double actAlpha = 4.15d;
final double actBeta = 1.45d;
final double sigma = 20d;
final int regressionCount = 100000;
final Range<Integer> rows = Range.of(0, regressionCount);
final Array<String> columns = Array.of("Beta", "Alpha");
final DataFrame<Integer,String> results = DataFrame.ofDoubles(rows, columns);
//Run 100K regressions in parallel
results.rows().parallel().forEach(row -> {
final DataFrame<Integer,String> frame = sample(actAlpha, actBeta, 0, 1, sigma, n);
frame.regress().ols("Y", "X", true, model -> {
final double alpha = model.getInterceptValue(DataFrameLeastSquares.Field.PARAMETER);
final double beta = model.getBetaValue("X", DataFrameLeastSquares.Field.PARAMETER);
row.setDouble("Alpha", alpha);
row.setDouble("Beta", beta);
return Optional.empty();
});
});
Array.of("Beta", "Alpha").forEach(coefficient -> {
Chart.create().withHistPlot(results, 250, coefficient, chart -> {
final double mean = results.colAt(coefficient).stats().mean();
final double stdDev = results.colAt(coefficient).stats().stdDev();
final double actual = coefficient.equals("Beta") ? actBeta : actAlpha;
final String title = "%s Histogram from %s Regressions (n=%s)";
final String subtitle = "Actual: %.4f, Mean: %.4f, StdDev: %.4f";
chart.title().withText(String.format(title, coefficient, regressionCount, n));
chart.subtitle().withText(String.format(subtitle, actual, mean, stdDev));
chart.show(700, 400);
});
});
```

The alpha and beta histogram plots above clearly show that the distribution of the 100000 estimates of each coefficient are centered on the known population values. In the case of the slope coefficient, the known population value is 1.45 and the mean value over the 100000 estimates is a good match. Similarly, the intercept estimate mean 4.1266 is very close to the known population value of 4.15.

### Consistency

An OLS estimator is said to be consistent in the sense that as the sample size increases, the variance in the coefficient estimates should decrease. This can be demonstrated empirically once again using the data generation function introduced earlier. In this experiment we run a certain number of regressions based on samples generated from a known population process, but we would need to do this multiple times with increasing sample sizes. The code below implements this by running 100,000 regressions for sample sizes ranging from 100 to 500 in steps of 100, and captures the coefficient estimates for each run. It then plots histograms for the beta and intercept estimates to illustrate the narrowing variance as sample size increases.

```
final double actAlpha = 4.15d;
final double actBeta = 1.45d;
final double sigma = 20d;
final int regressionCount = 100000;
final Range<Integer> sampleSizes = Range.of(100, 600, 100);
final Range<Integer> rows = Range.of(0, regressionCount);
final DataFrame<Integer,String> results = DataFrame.of(rows, String.class, columns -> {
sampleSizes.forEach(n -> {
columns.add(String.format("Beta(n=%s)", n), Double.class);
columns.add(String.format("Alpha(n=%s)", n), Double.class);
});
});
sampleSizes.forEach(n -> {
System.out.println("Running " + regressionCount + " regressions for n=" + n);
final String betaKey = String.format("Beta(n=%s)", n);
final String alphaKey = String.format("Alpha(n=%s)", n);
results.rows().parallel().forEach(row -> {
final DataFrame<Integer,String> frame = sample(actAlpha, actBeta, 0, 1, sigma, n);
frame.regress().ols("Y", "X", true, model -> {
final double alpha = model.getInterceptValue(DataFrameLeastSquares.Field.PARAMETER);
final double beta = model.getBetaValue("X", DataFrameLeastSquares.Field.PARAMETER);
row.setDouble(alphaKey, alpha);
row.setDouble(betaKey, beta);
return Optional.empty();
});
});
});
Array.of("Beta", "Alpha").forEach(coeff -> {
final DataFrame<Integer,String> coeffResults = results.cols().select(col -> col.key().startsWith(coeff));
Chart.create().withHistPlot(coeffResults, 250, true, chart -> {
chart.plot().axes().domain().label().withText("Coefficient Estimate");
chart.title().withText(coeff + " Histograms of " + regressionCount + " Regressions");
chart.subtitle().withText(coeff + " Variance decreases as sample size increases");
chart.legend().on().bottom();
chart.show(700, 400);
});
});
```

It is clear from the above plots that as sample size increases, the variance in the estimates decreases, which is what we expect if the estimator is consistent. The bar charts below summarize the change in variance for each of the coefficients, and is follow by the code that generates these plots.

```
Array<DataFrame<String,StatType>> variances = Array.of("Beta", "Alpha").map(value -> {
final String coefficient = value.getValue();
final Matcher matcher = Pattern.compile(coefficient + "\\(n=(\\d+)\\)").matcher("");
return results.cols().select(column -> {
final String name = column.key();
return matcher.reset(name).matches();
}).cols().mapKeys(column -> {
final String name = column.key();
if (matcher.reset(name).matches()) return matcher.group(1);
throw new IllegalArgumentException("Unexpected column name: " + column.key());
}).cols().stats().variance();
});
Chart.show(2, Collect.asList(
Chart.create().withBarPlot(variances.getValue(0), false, chart -> {
chart.title().withText("Beta variance with sample size");
chart.plot().style(StatType.VARIANCE).withColor(new Color(255, 100, 100));
chart.plot().axes().range(0).label().withText("Beta Variance");
chart.plot().axes().domain().label().withText("Sample Size");
}),
Chart.create().withBarPlot(variances.getValue(1), false, chart -> {
chart.title().withText("Alpha variance with sample size");
chart.plot().style(StatType.VARIANCE).withColor(new Color(102, 204, 255));
chart.plot().axes().range(0).label().withText("Alpha Variance");
chart.plot().axes().domain().label().withText("Sample Size");
})
));
```