Missing Values

In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Data can be missing at random or not at random
In metabolomics missing values arise when, for a sample, it was not possible to get a specific intensity for a feature/metabolite

Notes

In metabolomics missing values pop up for different reasons

There was an error during preprocessing (targeted or untargeted)
For a sample the concentration of a metabolite was really low
A metabolite was “absent” in a sample I’ll get a missing value (eg. red pigment in a an extract of white grape)

1,2 & 3 are random or not random ?

Analytical Absence

Having a Missing is not equivalent of having zero concentration!

Each analytical method is characterized by a well defined detection limit
Every concentration between zero and the detection limit will not be detected
We never can be 100% sure that something is absent from my sample …

Where can we get the detection limit ?

Scenario 1: 1 class

Signal close to the detection limit

Missing values are “randomly” popping-up

Scenario 2: 2 class biomarker

Missing values are ~~not~~ “randomly” popping-up

Scenario 3: 2 class ??

What would you propose to do in the previous three cases?

Dealing with missing values

Use statistical methods able to handle missing data
~~Impute~~ them put a reasonable number with variability
Remove features with too many NAs (scenario 3)

How many NAs are acceptable depends on the allocation of the samples to the factors of the study

Imputation

Imputation is the process of substituting a missing value with a reasonable number:

multivariate imputation uses the value of samples whicha are close in the multivariate space to select a good number (missMDA package, KNN imputation, … ). This works well if data are missing at random …
a reasonable number can be chosen on the bases of analytical considerations (e.g. a random number between zero and the detection limit)
the imputation strategy should be not “aware” of the design of the study

Further Observations

It is “easier” to handle missing values randomly distributed
Use domain specific knowledge (e.g analytical - LOD) inject new knowledge in the data analysis pipeline!
Quality Check: try different forms of imputation: are the outcomes sensitive to that?