## Principal Component Analysis

Principal Component Analysis (PCA) is a statistical tool used in data analysis and also for building predictive models. The technique involves transforming a dataset into a new basis whereby the transformed data is uncorrelated. The transformed basis, which can be represented by an orthogonal matrix, defines the Principal Components of the original dataset. These basis vectors are usually ordered so that the first principal component is the one that accounts for the largest variance in the data, and the last component accounts for the least variance.

PCA is also often referred to as a dimensional reduction technique in the sense that by dropping components that account for a negligible amount of variance in the original data, we can linearly map the data into a lower dimensional space without loosing material information. The assumption here is that the variability in the data represents the essential dynamics we are trying to understand, so dropping dimensions with negligible variance results in minimal loss of information.

The following sections introduces some PCA theory, and then proceeds to illustrate an example of how to use the Morpheus API to perform PCA on a dataset. Only a superficial overview of the theory is covered below, so for a more detailed treatment of the topic I would suggest a Google search, or perhaps this tutorial as a primer.

## Theory

As a data analysis technique, PCA begins with the definition of our data which in general can be described in two dimensions,
namely the number of observations and the number of measurements. Such data can be represented by an `nxp`

matrix where `n`

represents the number of observations for each measurement, and `p`

represents the number of measurements being recorded.

The dimensions in which we record the observations that constitute our data are assumed to be a naive basis, since we do not actually understand the true dynamics of the system we are investigating (hence the analysis). Having said that, our hope is that while our measurements may be recorded in a naive basis, they are informative enough so that we can compute a new basis that maximises the signal to noise ratio and removes any redundancy in the data, enabling us to better understand its true dynamics.

With this in mind, let us assume that there exists an orthogonal matrix
V of dimensions `pxp`

that can transform our data X into Y such that the covariance matrix of Y (denoted by \(\Sigma_{y}\))
is diagonal.

$$ X V = Y $$

The question then is how to find V? We can tackle this by working backwards based on our desire that the transformed data Y
has a diagonal covariance matrix, given our motivation to remove noise and redundancy from our dataset. Assuming that Y is
**centered** or **demeaned**, we can define the covariance of Y as follows:

$$ \Sigma_{Y} = \frac{1}{n-1} Y^T Y $$

Given our transform \(X V = Y \) we can express the covariance of Y in terms V and X as follows:

$$ \begin{align} \Sigma_{Y} &= \frac{1}{n-1} Y^T Y \\ &= \frac{1}{n-1}(XV)^T(XV) \\ &= \frac{1}{n-1}V^T X^T X V \\ \end{align} $$

Let us define a new matrix A, which by definition is a `pxp`

symmetric matrix as:

$$ A = \frac{1}{n-1} X^T X $$

We can therefore re-write our earlier expression for \(\Sigma_{Y}\) in terms of A as:

$$ \Sigma_{Y} = V^T A V $$

In the next section, we illustrate how we can choose \(V\) such that we diagonalize \(\Sigma_{Y}\).

### Eigen Decomposition

Based on our earlier discussion, we know that we need to choose a transform matrix V such that \(V^TAV = D\) where \(D\) is a diagonal matrix. Given that \(A\) is symmetric, we know that we can factorize it using an Eigendecomposition into an orthogonal matrix of its eigenvectors and a diagonal matrix of its eigenvalues:

$$ A = Q \Lambda Q^{-1} $$

In this factorization the column vectors of \(Q\) are the eigenvectors of \(A\) and the diagonal elements of \(\Lambda\)
are the eigenvalues of \(A\). Recall that an eigenvector of a matrix is one which is only scaled and not rotated when operated
on by that matrix. This is otherwise stated as \(Av = \lambda v \) where \(v\) is an `px1`

eigenvector of A and \(\lambda\) is
the corresponding eigenvalue, which is a scalar. We can therefore expand this to say that \(A Q = Q \Lambda \) which then
implies that \(A = Q \Lambda Q^{-1}\).

If we choose our matrix \(V = Q\) and noting that Q is an orthogonal matrix such that \(Q^{-1} = Q^T \) we can plug these back into the expression for the covariance matrix of Y to yield the following:

$$ \begin{align} \Sigma_{Y} &= V^T A V \\ &= V^T (V \Lambda V^{-1}) V \\ &= (V^TV)\Lambda(V^{-1}V) \\ &= (V^{-1}V)\Lambda(V^{-1}V) \\ &= \Lambda \\ \end{align} $$

By choosing the transform matrix V to be the eigenvectors of \(A \) (which in essence is the covariance matrix of our data on the assumption that \(X\) is centered), we end up diagonalizing the covariance matrix of the transformed dataset, which is the ultimate objective of our PCA. These eigenvectors essentially define the new basis along which variance in the data is maximised, and the corresponding eigenvalues define the magnitude of this variance. As noted earlier, the eigenvectors or principal components are usually ordered so that the first component accounts for the largest variance (ie the largest eigenvalue), and the last component accounts for the smallest variance (ie the smallest eigenvalue).

### Singular Value Decomposition

The previous section demonstrated that an eigen decomposition of the covariance matrix of \(X\) yields the set of eigenvectors \(V\) that define the principal axes of our data. When we transform our data using \(V\) the resulting dataset has a diagonal covariance matrix which reflects that our new basis maximises the signal to noise ratio and/or removes any redundancy.

The Morpheus library supports solving PCA in this way, but by default, it performs a Singular Value Decomposition
(SVD) of the **centered** or **demeaned** data in \(X\), as this generally offers better numerical stability and also tends to be faster
then an eigendecomposition of the covariance matrix of \(X\). An SVD of \(X\) on the assumption that \(X\) is centered yields

$$ X = U S V^T $$

where S is a diagonal matrix of singular values, and the columns of \(U\) and \(V\) are called the left-singular vectors and right-singular vectors of \(X\) , respectively. If we substitute the SVD decomposition into the expression for the covariance of \(X\) as below, we can see that the solution is essentially equivalent to above, however in this case the eigenvalues are equivalent to the square of the singular values divided by \(n-1\).

$$ \begin{align} \Sigma_{X} &= \frac{1}{n-1} X^T X \\ &= \frac{1}{n-1} (U S V^T)^T(U S V^T) \\ &= \frac{1}{n-1} VSU^TUSV^T \\ &= \frac{1}{n-1} VS^2V^T \\ \end{align} $$

## Example

### Data Model

In this example, we demonstrate the use of Principal Component Analysis as a dimensional reduction technique, and in particular
we apply it to the problem of image compression. Consider the photo below of my dog who is called Poppet,
which is 504 pixels wide and 360 pixels high. The pixels that make up this image can be thought of as a `360x504`

matrix, where the elements
represent the color of each pixel. It is most common in computer graphics to represent such an image using the RGBA Color Space
where each pixel is defined by a 32-bit integer which has encoded within it four 8-bit
values representing its red, green, blue and alpha intensity. Since each component within the RGBA value is represented by an
8-bit sequence, they have a range between 0 and 255 in base-10.

We can load the target image into a Morpheus `DataFrame`

of RGBA values using the code below. Here we initialize a frame of
integer values with the row and column count based on the image dimensions.

```
import java.net.URL;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
URL url = getClass().getResource("/poppet.jpg");
BufferedImage image = ImageIO.read(url);
int rowCount = image.getHeight()
int colCount = image.getWidth();
DataFrame<Integer,Integer> rgbFrame = DataFrame.ofInts(rowCount, colCount, v -> {
return image.getRGB(v.colOrdinal(), v.rowOrdinal());
});
```

Given that each pixel is represented by a 32-bit RGBA value, we can decompose the `360x504`

matrix into a `360x504x4`

cube of data,
or a `360x504x3`

cube if we ignore the alpha channel on the assumption that each pixel is 100% opaque (which is a reasonable assumption
in this case). In order to decompose the matrix into the red, green and blue components we need to perform some bitwise operations.
In this example, we load the target image using `java.awt.image.BufferedImage`

which exposes the 32-bit RGB values in the form
illustrated below, where the first 8 most significant bits represent the alpha channel, the next 8 represent the red value, followed
by green and then blue.

To extract the 8-bit value representing the red intensity, we first need to shift our string of bits 16 places to the right so that
the 8-bits representing the value of red appear in bit positions 0-7 (which before the shift represented the blue intensity). With our
bit string now in this form, we can bitwise AND it with a base-10 value of 255 or `0xFF`

in hexadecimal so that all bits in positions
8-31 become zero, leaving only the value of our red intensity. Similarly, to extract the value of green, we right shift our RGBA bit
string by 8, and then bitwise AND with `0xFF`

. In the case of extracting blue, no bit shifting is required and we simply bitwise AND
with `0xFF`

. The code below generates 3 separate frames to capture the red, green and blue intensities by performing the bitwise
operations just described.

```
DataFrame<Integer,Integer> red = rgbFrame.mapToDoubles(v -> (v.getInt() >> 16) & 0xFF);
DataFrame<Integer,Integer> green = rgbFrame.mapToDoubles(v -> (v.getInt() >> 8) & 0xFF);
DataFrame<Integer,Integer> blue = rgbFrame.mapToDoubles(v -> v.getInt() & 0xFF);
```

### Explained Variance

Now that we know how to decompose our image into three `360x504`

frames representing the red, green and blue intensity of our image
pixels, we can perform Principal Component Analysis on each `DataFrame`

, and assess the results. Since PCA is all about transforming data
into a new basis in which the variance is maximised, it is often useful to get a sense of how much of the variance in the data is explained
by the first N components. The code below uses the `rgbFrame`

initialized above, extracts the red, green and blue components, performs
PCA on each of these frames, and then collects the percent of variance explained by the respective principal components. This data is
then trimmed to only include the first 10 components, which is then plotted using a bar chart. Note that in this example we **transpose**
the `DataFrame`

before calling the `pca()`

method as the Morpheus library assumes that the data is an `nxp`

matrix where `n >= p`

.

```
URL url = getClass().getResource("/poppet.jpg");
DataFrame<Integer,Integer> rgbFrame = DataFrame.ofImage(url);
Range<Integer> rowKeys = Range.of(0, rgbFrame.rowCount());
DataFrame<Integer,String> result = DataFrame.ofDoubles(rowKeys, Array.of("Red", "Green", "Blue"));
Collect.<String,DataFrame<Integer,Integer>>asMap(mapping -> {
mapping.put("Red", rgbFrame.mapToDoubles(v -> (v.getInt() >> 16) & 0xFF));
mapping.put("Green", rgbFrame.mapToDoubles(v -> (v.getInt() >> 8) & 0xFF));
mapping.put("Blue", rgbFrame.mapToDoubles(v -> v.getInt() & 0xFF));
}).forEach((name, color) -> {
color.transpose().pca().apply(true, model -> {
DataFrame<Integer,Field> eigenFrame = model.getEigenValues();
DataFrame<Integer,Field> varPercent = eigenFrame.cols().select(Field.VAR_PERCENT);
result.update(varPercent.cols().mapKeys(k -> name), false, false);
return Optional.empty();
});
});
DataFrame<Integer,String> chartData = result.rows().select(c -> c.ordinal() < 10).copy();
Chart.create().withBarPlot(chartData.rows().mapKeys(r -> String.valueOf(r.ordinal())), false, chart -> {
chart.plot().style("Red").withColor(Color.RED);
chart.plot().style("Green").withColor(Color.GREEN);
chart.plot().style("Blue").withColor(Color.BLUE);
chart.plot().axes().range(0).label().withText("Percent of Variance");
chart.plot().axes().domain().label().withText("Principal Component");
chart.title().withText("Eigen Spectrum (Percent of Explained Variance)");
chart.legend().on().bottom();
chart.show();
});
```

We can see from this chart that in the case of the red frame, the first principal component explains around 45% of the variance, for green it is just under 35% and for blue it is just over 25%. The percentage of the variance explained by subsequent components drops off fairly monotonically, and by the time we get to the fifth component, only about 5% of the variance is captured for each of the colors.

### Dimensional Reduction

As per the earlier discussion on PCA theory, we established that the eigenvectors of the covariance matrix of our data are the principal
axes, and when combined as the columns of a matrix, serve as a transformation into the new basis. If we let \(X_i\) represent our `nxp`

input dataset of either red, green or blue intensities, and \(V_i\) our matrix of `pxp`

eigenvectors of the covariance matrix of \(X_i\),
we can write the transform as follows (where \(i\) is either red, green or blue).

$$ X_i V_i = Y_i $$

This projection of the original data onto the new basis are called the **principal component scores**, and are directly accessible from
the Morpheus interface named `DataFramePCA.Model`

via the `getScores()`

method. The following code demonstrates how to access these scores,
and here we assert our expectation of the dimensions of these scores being `nxp`

or `504x360`

in this case (since we take the transpose
of the image).

```
URL url = getClass().getResource("/poppet.jpg");
DataFrame<Integer,Integer> image = DataFrame.ofImage(url).transpose();
DataFrame<Integer,Integer> red = image.mapToDoubles(v -> (v.getInt() >> 16) & 0xFF);
red.pca().apply(true, model -> {
DataFrame<Integer,Integer> scores = model.getScores();
Assert.assertEquals(scores.rowCount(), 504);
Assert.assertEquals(scores.colCount(), 360);
return Optional.empty();
});
```

Since we know that \(V_i\) is an orthogonal matrix by design, once we have transformed to the new basis of \(Y_i\) we can transform back to the original basis by taking the dot product of \(Y_i\) with \(V_i^T\) as follows:

$$ X_i = Y_i V_i^T $$

The eigenvectors that constitute the columns of \(V_i\) are arranged so that the first column is associated with the highest eigenvalue, and the last column with the lowest eigenvalue (recall that the eigenvalue represents the variance in the direction of the corresponding eigenvector). Given that much of the variance in the data will be explained by the leading eigenvectors, we consider truncating \(V_i\) by only retaining the first N columns. In this case, we can re-write the above expression using a tilde over \(V_i\) and \(Y_i\) to indicate that some information has been lost in this new transform due to the truncation of \(V_i\).

$$ X_i \tilde{V_i} = \tilde{Y_i} $$

The Morpheus API provides an over-loaded `getScores()`

method on `DataFramePCA.Model`

where we can generate \(\tilde{Y_i}\) by selecting
only the first `j`

columns of \(V_i\) as shown below. In this case we assert that the expected dimensions of the scores is `nxk`

rather
than `nxp`

, where `k`

is the number of components to include (below we use `k=10`

).

```
URL url = getClass().getResource("/poppet.jpg");
DataFrame<Integer,Integer> image = DataFrame.ofImage(url).transpose();
DataFrame<Integer,Integer> red = image.mapToDoubles(v -> (v.getInt() >> 16) & 0xFF);
red.pca().apply(true, model -> {
DataFrame<Integer,Integer> scores = model.getScores(10);
Assert.assertEquals(scores.rowCount(), 504);
Assert.assertEquals(scores.colCount(), 10);
return Optional.empty();
});
```

Given the truncated scores in the form of \(\tilde{Y_i}\), it turns out that we can project these scores back onto the original basis
using \(\tilde{V_i}^T\) much in the same way as described earlier. If we right multiply \(\tilde{Y_i}\) by \(\tilde{V_i}^T\)
we get back `nxp`

data since \(\tilde{Y_i}\) is `nxk`

and \(\tilde{V_i}^T\) is `kxp`

, and yields an estimate of our original data
we now call \(\tilde{X_i}\)

$$ \tilde{X_i} = \tilde{Y_i} \tilde{V_i}^T = X_i \tilde{V_i} \tilde{V_i}^T $$

The Morpheus API provides a convenient API to generate \(\tilde{X_i}\) based on a specified number of components, `k`

. The following
code shows how to access the projection of our original data using only the first `k=10`

components, and we assert that the dimensions
of this data matches our original image, namely `504x360`

(since we transpose the image for reasons discussed earlier).

```
URL url = getClass().getResource("/poppet.jpg");
DataFrame<Integer,Integer> image = DataFrame.ofImage(url).transpose();
DataFrame<Integer,Integer> red = image.mapToDoubles(v -> (v.getInt() >> 16) & 0xFF);
red.pca().apply(true, model -> {
DataFrame<Integer,Integer> projection = model.getProjection(10);
Assert.assertEquals(projection.rowCount(), 504);
Assert.assertEquals(projection.colCount(), 360);
return Optional.empty();
});
```

We have established that it is possible to reconstitute an estimate of our original data \(X_i\), which we called \(\tilde{X_i}\), from the principal component scores and a subset of the principal axes associated with the highest variance. The next question is how many columns of \(V_i\) need to be retained to ensure \(\tilde{X_i}\) is a reasonable representation of the original data? There is no hard rule in this regard, however a common rule of thumb is to select enough components to ensure that 90% of the variance is captured. Having said that, each problem will be unique, and it will often be useful to generate an eigen spectrum plot as shown above to draw some conclusion as to a reasonable initial estimate.

In the case of our image of Poppet, we will demonstrate the effect of retaining an increasing number of principal axes in \(V_i\) to compute principal component scores, and then to project this back onto the original basis. The images below show a range of scenarios where we project the image using only 5 components all the way through to 70 components. Note that this is still a small subset of the total number of components, namely 360 in this case, but it is clear that once we include up to 50 components, the transformed image is almost indistinguishable from the original, at least to the human eye.

5 Principal Components |
10 Principal Components |
15 Principal Components |

20 Principal Components |
25 Principal Components |
30 Principal Components |

35 Principal Components |
40 Principal Components |
45 Principal Components |

50 Principal Components |
55 Principal Components |
60 Principal Components |

65 Principal Components |
70 Principal Components |
360 Principal Components |

The final image in the table above is essentially the same as the original since we retain all components and so \(V_i {V_i}^T = I \)
given that we know \(V_i\) is an orthogonal matrix by design. The code to generate this array of images is shown below. Here we load the
original image into a Morpheus `DataFrame`

, and then proceed to decompose it into red, green and blue components, perform PCA on each color,
project the image as described above using only a subset of the principal axes associated with highest variance, and then record the resulting
projection back out as an image file.

```
//Load image from classpath
URL url = getClass().getResource("/poppet.jpg");
//Re-create PCA reduced image while retaining different number of principal components
Array.of(5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 360).forEach(nComp -> {
//Initialize the **transpose** of image as we need nxp frame where n >= p
DataFrame<Integer,Integer> rgbFrame = DataFrame.ofImage(url).transpose();
//Create 3 frames from RGB data, one for red, green and blue
DataFrame<Integer,Integer> red = rgbFrame.mapToDoubles(v -> (v.getInt() >> 16) & 0xFF);
DataFrame<Integer,Integer> green = rgbFrame.mapToDoubles(v -> (v.getInt() >> 8) & 0xFF);
DataFrame<Integer,Integer> blue = rgbFrame.mapToDoubles(v -> v.getInt() & 0xFF);
//Perform PCA on each color frame, and project using only first N principal components
Stream.of(red, green, blue).parallel().forEach(color -> {
color.pca().apply(true, model -> {
DataFrame<Integer,Integer> projection = model.getProjection(nComp);
projection.cap(true).doubles(0, 255); //cap values between 0 and 255
color.update(projection, false, false);
return null;
});
});
//Apply reduced RBG values onto the original frame so we don't need to allocate memory
rgbFrame.applyInts(v -> {
int i = v.rowOrdinal();
int j = v.colOrdinal();
int r = (int)red.data().getDouble(i,j);
int g = (int)green.data().getDouble(i,j);
int b = (int)blue.data().getDouble(i,j);
return ((0xFF) << 24) | ((r & 0xFF) << 16) | ((g & 0xFF) << 8) | ((b & 0xFF));
});
//Create reduced image from **transpose** of the DataFrame to get back original orientation
int width = rgbFrame.rowCount();
int height = rgbFrame.colCount();
BufferedImage transformed = new BufferedImage(width, height, BufferedImage.TYPE_INT_RGB);
rgbFrame.forEachValue(v -> {
int i = v.colOrdinal();
int j = v.rowOrdinal();
int rgb = v.getInt();
transformed.setRGB(j, i, rgb);
});
try {
File outputfile = new File("/Users/witdxav/temp/poppet-" + nComp + ".jpg");
outputfile.getParentFile().mkdirs();
ImageIO.write(transformed, "jpg", outputfile);
} catch (Exception ex) {
throw new RuntimeException("Failed to record image result", ex);
}
});
```

### Compression Story

The dimensions of our original input data \(X_i\) is `504x360`

which generates a covariance matrix with dimensions `360x360`

.
Performing PCA on this data yields a set of `360`

eigenvectors each of the same length, implying that the non-truncated version of
\(V_i\) is also `360x360`

. If we decide to only keep the first `k`

columns of \(V_i\) to create \(\tilde{V_i}\), then the resulting
dimensions of \(\tilde{Y_i}\) will be `504xk`

. If `k`

can be significantly smaller than `360`

(the height of the original image) then
\(\tilde{Y_i}\) would require much less storage space. We obviously also need to store \(\tilde{V_i}\) so that we can reconstitute
our estimate of the data \(\tilde{X_i}\), but the expectation is that `k`

can be small enough so that the storage required for two
smaller matrices is less than that required for the original image.

In our example, the original image requires `181,440`

32-bit RGBA values given it has dimensions of `504x360`

pixels. The table below
summarizes the total number of elements required to store \(\tilde{Y_i}\) and \(\tilde{V_i}\) for various values of k ranging
from 5 through 60. We need to multiply this by 3 since we need to store a red, green and blue version of these matrices. The final
column indicates the percent reduction on the original number of elements we achieve, and since the image reconstituted by retaining
only the first 45 components is almost indistinguishable from the original, we can achieve 35% compression in that case.

k | \(\tilde{Y_i}\) | \(\tilde{V_i}\) | Total | Total x 3 | Compression |
---|---|---|---|---|---|

5 | 504x5 = 2,520 | 360x5 = 1,800 | 4,320 | 12,960 | 92.86% |

10 | 504x10 = 5,040 | 360x10 = 3,600 | 8,640 | 25,920 | 85.71% |

15 | 504x15 = 7,560 | 360x15 = 5,400 | 12,960 | 38,880 | 78.57% |

20 | 504x20 = 10,080 | 360x20 = 7,200 | 17,280 | 51,840 | 71.43% |

25 | 504x25 = 12,600 | 360x25 = 9,000 | 21,600 | 64,800 | 64.29% |

30 | 504x30 = 15,120 | 360x30 = 10,800 | 25,920 | 77,760 | 57.14% |

35 | 504x35 = 17,640 | 360x35 = 12,600 | 30,240 | 90,720 | 50.00% |

40 | 504x40 = 20,160 | 360x40 = 14,400 | 34,560 | 103,680 | 42.86% |

45 | 504x45 = 22,680 | 360x45 = 16,200 | 38,880 | 116,640 | 35.71% |

50 | 504x50 = 25,200 | 360x50 = 18,000 | 43,200 | 129,600 | 28.57% |

55 | 504x55 = 27,720 | 360x55 = 19,800 | 47,520 | 142,560 | 21.43% |

60 | 504x60 = 30,240 | 360x60 = 21,600 | 51,840 | 155,520 | 7.14% |