Multivariate data - the Data Matrix

The element \(X_{ij}\) contains the value of variable j in the sample i

Key ides

  1. Each sample is a point in the n dimensional space of the measured variables.
  2. The measured variables are often redundant
  3. To “see” what happens in the large space a clever idea is to project the samples in alower dimensional space
  4. Projection will always show us a part of the reality
  5. There is always an error associated to a projection

Intrinsec Dimensionality

In presence of correlation among the variables, the samples actually occupy only a “fraction” of the potential multidimensional space. Here a projection is highly informative


Latent Variable

Mathematical combination of several variables

Projecting the data along specific latent variables, we highlight some desired property of the data. In a more broad sense, the latent variables can also be seen as the mathematical representation of the hidden rules which determine the sample behavior


  1. Separation of sample classes (e.g. LDA)
  2. Prediction of sample properties (e.g. PLS)
  3. Good representation of the multidimensional data structure (e.g PCA, PCoA)

Food for brains


  • Some relevant “characteristics” of a system cannot be measures or are difficult to be defined (e.g. health, intelligence)
  • In this sense they are latent
  • We can measure several properties associated with them (e.g. for intelligence, math score, QI tests, school grades, … ecc)
  • The latent variable “model” is exactly trying to formalize this ides

LVs and Projections

A set of latent variable can be used to reconstruct an informative representation of the dataset which captures some relevant multidimensional aspects of the data.

This representation is constructed “projecting” the samples on the LVs


Each projection will result in:

  • Scores: the representation of the samples in the LV space (the new coordinates)
  • Loadings: the “weight” of the original variables on the single LVs (_the recipe to construct the new variables from the measured ones)

Dummy Dataset

What LV will maximize the separation between the two groups? Can you guess something about the loadings?

LV for class discrimination (LDA)

  • The red line represents the direction of maximal separation between the two classes
  • The crosses are the scores along this direction

Loadings for class discrimination

The loadings represent the weight of the initial variables along the discriminating direction

  • Var a: 0.9987687
  • Var b: 0.0496086

Principal Component Analysis (PCA)

The aim of PCA is dimension reduction and PCA is the most frequently applied method for computing linear latent variables (components).

The transformation is defined in such a way way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

  • In PCA the “objective” of the projection is to maximize variance (spread of the point cloud)
  • PCA “view” will enhance the spread of the data
  • The key idea is that variability means information

Animation!

Dummy Dataset

What LV will highlight the direction of maximal variance?

PCA of the dummy dataset

  • The red line represents the direction of maximal variance (bad separation!)
  • The crosses are the scores along this direction

Loadings for PC1

The loadings represent the weight of the initial variables along PC1

  • Var a: 0.0899434
  • Var b: 0.9959469

LDA and PCA

  • The latent variable are different!
  • They try to highlight different aspect of the data
  • PCA does not known anything about the groups (technically is unsupervised)

PCA uses

  • Visualization of multivariate data by scatter plots
  • Transformation of highly correlating x-variables into a smaller set of uncorrelated latent variables that can be used by other methods
  • Separation of relevant information (described by a few latent variables) from noise
  • Combination of several variables that characterize a chemical-technological-biological process into a single or a few “characteristic” variables
  • Make the “latent properties” actually measurable

Scaling and centering

  • To point in the direction of maximum variance the data have to be centered
  • The spread of the projection depends on the scaling

Outliers

Notes

  • Sensitivity to outliers is useful if PCA is used to spot them ;-)
  • robust versions of PCA are available to keep all data in
  • PCA show the big “structure” of my data and this can help in interpretation
  • PCA will change if you add points !!!
  • The loadings are not always easy to interpret

A projection can be non flat …

  • All the projection methods discussed so far are based on linear algebra
  • Projection subspaces are then flat (lines, planes, hyperplanes, …)
  • Flat projections could fail to capture the overall structure of the dataset
  • The challenge would be to capture at the same time the large scale and small scale structure of the data.

Possible appraoches

  • t-SNE
  • UMAP

10 normal groups, 3D

PCA

10 normal groups, 3D

t-SNE

Distance based approaches