PCA

Multivariate data - the Data Matrix

The element \(X_{ij}\) contains the value of variable j in the sample i

Key ides

Each sample is a point in the n dimensional space of the measured variables.
The measured variables are often redundant
To “see” what happens in the large space a clever idea is to project the samples in alower dimensional space
Projection will always show us a part of the reality
There is always an ~~error~~ associated to a projection

Intrinsec Dimensionality

In presence of correlation among the variables, the samples actually occupy only a “fraction” of the potential multidimensional space. Here a projection is highly informative

Latent Variable

~~Mathematical combination of several variables~~

Projecting the data along specific latent variables, we highlight some desired property of the data. In a more broad sense, the latent variables can also be seen as the mathematical representation of the ~~hidden rules~~ which determine the sample behavior

Separation of sample classes (e.g. LDA)
Prediction of sample properties (e.g. PLS)
Good representation of the multidimensional data structure (e.g PCA, PCoA)

Food for brains

Some relevant “characteristics” of a system cannot be measures or are difficult to be defined (e.g. health, intelligence)
In this sense they are latent
We can measure several properties associated with them (e.g. for intelligence, math score, QI tests, school grades, … ecc)
The latent variable “model” is exactly trying to formalize this ides

LVs and Projections

A set of latent variable can be used to reconstruct an informative representation of the dataset which captures some relevant multidimensional aspects of the data.

This representation is constructed “projecting” the samples on the LVs

~~Each projection will result in:~~

Scores: the representation of the samples in the LV space (the new coordinates)
Loadings: the “weight” of the original variables on the single LVs (_the recipe to construct the new variables from the measured ones)

Dummy Dataset

What LV will maximize the separation between the two groups? Can you guess something about the loadings?

LV for class discrimination (LDA)

The red line represents the direction of maximal separation between the two classes
The crosses are the scores along this direction

Loadings for class discrimination

The loadings represent the weight of the initial variables along the discriminating direction

Var a: 0.9987687
Var b: 0.0496086

Principal Component Analysis (PCA)

The aim of PCA is dimension reduction and PCA is the most frequently applied method for computing linear latent variables (components).

The transformation is defined in such a way way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

In PCA the “objective” of the projection is to maximize variance (spread of the point cloud)
PCA “view” will enhance the spread of the data
The key idea is that variability means information

Animation!

PCA Animation

Dummy Dataset

What LV will highlight the direction of maximal variance?

PCA of the dummy dataset

The red line represents the direction of maximal variance (bad separation!)
The crosses are the scores along this direction

Loadings for PC1

The loadings represent the weight of the initial variables along PC1

Var a: 0.0899434
Var b: 0.9959469

LDA and PCA

The latent variable are different!
They try to highlight different aspect of the data
PCA does not known anything about the groups (technically is unsupervised)

PCA uses

Visualization of multivariate data by scatter plots
Transformation of highly correlating x-variables into a smaller set of uncorrelated latent variables that can be used by other methods
Separation of relevant information (described by a few latent variables) from noise
Combination of several variables that characterize a chemical-technological-biological process into a single or a few “characteristic” variables
Make the “latent properties” actually measurable

Scaling and centering

To point in the direction of maximum variance the data ~~have to be centered~~
The spread of the projection depends on the scaling

Outliers

Notes

Sensitivity to outliers is useful if PCA is used to spot them ;-)
robust versions of PCA are available to keep all data in
PCA show the big “structure” of my data and this can help in interpretation
PCA will change if you add points !!!
The loadings are not always easy to interpret

A projection can be non flat …

All the projection methods discussed so far are based on linear algebra
Projection subspaces are then flat (lines, planes, hyperplanes, …)
Flat projections could fail to capture the overall structure of the dataset
The challenge would be to ~~capture at the same time the large scale and small scale structure~~ of the data.

Multivariate data - the Data Matrix

Key ides

Intrinsec Dimensionality

Latent Variable

Food for brains

LVs and Projections

Dummy Dataset

LV for class discrimination (LDA)

Loadings for class discrimination

Principal Component Analysis (PCA)

Animation!

Dummy Dataset

PCA of the dummy dataset

Loadings for PC1

LDA and PCA

PCA uses

Scaling and centering

Outliers

Notes

A projection can be non flat …

Possible appraoches

10 normal groups, 3D

PCA

10 normal groups, 3D

t-SNE

Distance based approaches