Metabolomics Data Matrix

Many variables in many samples

As in all omic experiments, a metabolomic assay allows to measure a large number of variables(properties) in a (hopefully) reasonably large number of samples.

~~… what are the variables …~~

In a targeted metabolomics investigation?
In an untargeted MS based metabolomics analysis?
In an untargeted NMR based metabolomics assay?

Number of variables

place your guess

Targeted metabolomics investigation
~~from 10 to 300~~
Unargeted metabolomics investigation
~~from 1000 to 30000~~

Data Matrix

Variables (columns) are not independent

Biological relations
Chemical/Analytical reasons

Analytical correlation are always stronger than biological ones!

The correlation structure of your matrix is often not really informative about biology

Rows (samples) are often dependent

The design of the study can result in a ~~multilevel hierarchical structure~~ of the samples which violate independence. In other words some samples are “by construction” associated and share something in common

~~Examples~~

Repeated measures of the same individual
Subpopulations
Different site, different day, …

Data Matrix size

In the typical “happy” statistical context, the number of variable we measure is smaller than the number of samples.

In omics the number of variables normally outperforms the number of samples

~~FAT DATA MATRIX~~

Univariate approach

The Univariate approach considers each variable separately and it applies “standard” statistical tools to spot the more interesting variables

Statistical testing
Linear modeling (lm, glm, …)
ANOVA
Hierarchical Modeling

Multiple testing

We measure many variables (features,metabolites) in the same set of samples
We run a battery of statistical tests looking for the significance of what we see in the individual variables
We ask ourselves if at least one variable is significant in the overall set

~~We run individual tests, but we have question about the full set of variables …~~

Multivariate approach

Each samples is represented as a point in the multidimesional variable space. The dataset is a cloud of points in that space.

The size of the space equals the number of variables we are measuring. Multivariate methods (PCA, PLS, ASCA, …) are able to exploit the correlation between the variables to highlight/extract the organization of the data

Why multivariate

Group separation is clearer in the 2d space …

Multivariate

~~PRO~~

Potentially more powerful
Explicit use of variable correlation
No issues with multiple testing

~~CONS~~

Chance correlations in fat matrices
Empty Space
Difficult to embed hierarchical structure

Univariate