Scaling and Normalization

Variable Scale

Metabolomics aims at measuring the concentration of :

The larger (as possible) number of metabolites
… over the larger (as possible) range of concentration

~~We typically are dealing with huge differences in signal across the different variables~~

potential issues in the reliability of some measurement
no direct effect on a univariate analysis
strong effect on multivariate analysis

Same data different scales

Take home

In the multivariate space the “shape” of the sample cloud depends on the absolute scale of the variables
All multivariate methods based on the distance in that space will be affected by the variable scale (e.g. PCA, Clustering, PCoA, …)

How to correct for that

Variable transformation (e.g. log or sqrt)
Variable scaling (e.g. unit variance)

Notes

The best choice depends on what you want to see
Increasing the weight of small signal is not always the best option
- low signals are less reliable
- missing values pop-up there
Log transformation (and sqrt to a lower extent), often also correct for the non-normal distribution of data

Sample Normalization

The overall metabolite concentration in one or more samples can be different from the others. The sample will show-up as an outlier.

There is a real difference in that sample. Good!
The samples show different levels of dilution (e.g. urine). ~~Bad!~~
The signal is lower for “analytical” reasons. ~~Bad!~~
- lower extraction of metabolites
- reduce response of the instrument (in particular for MS based techniques)
- …

Normalization is used to compensate for the unwanted difference in sample response

Normalization and Scaling

In terms of data matrix

Normalization is performed acting on the rows
Scaling is performed acting on the columns

Normalization strategies

Use chemical standards to compensate for analytical issues
Quality controls!
Wisely plan (randomize!) your analytical sequence to avoid biases

Normalize to the overall signal … mmm … )-:
Probabilistic Quotient Normalization (PQN)

Compositional Data

Sum normalization creates a new (…fake..) biomarker!

PQN

The idea behind Probabilistic Quotient Normalization is to identify a consensus normalization factor taking into consideration the distribution of the variable specific factor for each sample

The Recipe

Identify a reference sample
For each one of the other samples and each variable calculate the ratio with the reference
For each sample use the median of the distribution of the ratios to estimate the consensus normalization factor

In action

Consider 4 samples where we measure 100 metabolites
For three samples out of the four the value of each metabolites is extracted from a gaussian distribution with mean 200 and variance 10
For the fourth sample, the mean is 100 and the variance is 10

This design simulates a drop in the response of my pipeline of a factor of 2 while measuring the forth sample

Let’s take the first sample as reference …

Distribution of ratios

Code …

PQNratios %>% 
  group_by(name) %>% 
  summarise(median = median(value))

## # A tibble: 3 × 2
##   name  median
##   <chr>  <dbl>
## 1 s2     1.01 
## 2 s3     1.00 
## 3 s4     0.504

So to make S4 comparable to the others I should divide the signal by roughly 0.5
The use of the ~~median~~ ensures that the system will work also in presence of a few biomarkers