Many variables in many samples

As in all omic experiments, a metabolomic assay allows to measure a large number of variables(properties) in a (hopefully) reasonably large number of samples.




… what are the variables …

  1. In a targeted metabolomics investigation?
  2. In an untargeted MS based metabolomics analysis?
  3. In an untargeted NMR based metabolomics assay?

Number of variables

place your guess

  • Targeted metabolomics investigation
  • from 10 to 300
  • Unargeted metabolomics investigation
  • from 1000 to 30000

Data Matrix

Variables (columns) are not independent

  • Biological relations
  • Chemical/Analytical reasons


Analytical correlation are always stronger than biological ones!


The correlation structure of your matrix is often not really informative about biology

Rows (samples) are often dependent

The design of the study can result in a multilevel hierarchical structure of the samples which violate independence. In other words some samples are “by construction” associated and share something in common

Examples

  1. Repeated measures of the same individual
  2. Subpopulations
  3. Different site, different day, …

Data Matrix size

In the typical “happy” statistical context, the number of variable we measure is smaller than the number of samples.

In omics the number of variables normally outperforms the number of samples



FAT DATA MATRIX

Univariate approach

The Univariate approach considers each variable separately and it applies “standard” statistical tools to spot the more interesting variables

  1. Statistical testing
  2. Linear modeling (lm, glm, …)
  3. ANOVA
  4. Hierarchical Modeling

Multiple testing

  • We measure many variables (features,metabolites) in the same set of samples
  • We run a battery of statistical tests looking for the significance of what we see in the individual variables
  • We ask ourselves if at least one variable is significant in the overall set


We run individual tests, but we have question about the full set of variables …

Multivariate approach

Each samples is represented as a point in the multidimesional variable space. The dataset is a cloud of points in that space.

The size of the space equals the number of variables we are measuring. Multivariate methods (PCA, PLS, ASCA, …) are able to exploit the correlation between the variables to highlight/extract the organization of the data

Why multivariate

Group separation is clearer in the 2d space …

Multivariate

PRO

  1. Potentially more powerful
  2. Explicit use of variable correlation
  3. No issues with multiple testing

CONS

  1. Chance correlations in fat matrices
  2. Empty Space
  3. Difficult to embed hierarchical structure

Univariate

PRO

  1. Statistical modeling is there!
  2. Interpretable by construction

CONS

  1. Multiple testing
  2. The structure of the data creates redundancy
  3. Assumptions for parametric (and non parametric) approaches are often not fulfilled

The course of dimensionality

  1. To fill the space the number of points is not linear with the number of dimensions
  2. Already with 10 samples the 2d plot looks empty
  3. Can you imagine 20 samples in 10000 dimensions? ;-)

Practical !