Variable Scale

Metabolomics aims at measuring the concentration of :

  1. The larger (as possible) number of metabolites
  2. … over the larger (as possible) range of concentration



We typically are dealing with huge differences in signal across the different variables

  • potential issues in the reliability of some measurement
  • no direct effect on a univariate analysis
  • strong effect on multivariate analysis

Same data different scales

Take home

  • In the multivariate space the “shape” of the sample cloud depends on the absolute scale of the variables
  • All multivariate methods based on the distance in that space will be affected by the variable scale (e.g. PCA, Clustering, PCoA, …)

How to correct for that

  • Variable transformation (e.g. log or sqrt)
  • Variable scaling (e.g. unit variance)

Notes

  1. The best choice depends on what you want to see
  2. Increasing the weight of small signal is not always the best option
    • low signals are less reliable
    • missing values pop-up there
  3. Log transformation (and sqrt to a lower extent), often also correct for the non-normal distribution of data

Sample Normalization

The overall metabolite concentration in one or more samples can be different from the others. The sample will show-up as an outlier.

  1. There is a real difference in that sample. Good!
  2. The samples show different levels of dilution (e.g. urine). Bad!
  3. The signal is lower for “analytical” reasons. Bad!
    • lower extraction of metabolites
    • reduce response of the instrument (in particular for MS based techniques)



Normalization is used to compensate for the unwanted difference in sample response

Normalization and Scaling

In terms of data matrix

  1. Normalization is performed acting on the rows
  2. Scaling is performed acting on the columns

Normalization strategies

  1. Use chemical standards to compensate for analytical issues
  2. Quality controls!
  3. Wisely plan (randomize!) your analytical sequence to avoid biases



  1. Normalize to the overall signal … mmm … )-:
  2. Probabilistic Quotient Normalization (PQN)

Compositional Data

Sum normalization creates a new (…fake..) biomarker!

PQN

The idea behind Probabilistic Quotient Normalization is to identify a consensus normalization factor taking into consideration the distribution of the variable specific factor for each sample

The Recipe

  1. Identify a reference sample
  2. For each one of the other samples and each variable calculate the ratio with the reference
  3. For each sample use the median of the distribution of the ratios to estimate the consensus normalization factor

In action

  • Consider 4 samples where we measure 100 metabolites
  • For three samples out of the four the value of each metabolites is extracted from a gaussian distribution with mean 200 and variance 10
  • For the fourth sample, the mean is 100 and the variance is 10

This design simulates a drop in the response of my pipeline of a factor of 2 while measuring the forth sample

Let’s take the first sample as reference …

Distribution of ratios

Code …

PQNratios %>% 
  group_by(name) %>% 
  summarise(median = median(value))
## # A tibble: 3 × 2
##   name  median
##   <chr>  <dbl>
## 1 s2     1.01 
## 2 s3     1.00 
## 3 s4     0.504
  • So to make S4 comparable to the others I should divide the signal by roughly 0.5
  • The use of the median ensures that the system will work also in presence of a few biomarkers