Normal Distribution

  • The Normal (Gaussian) distribution has a prominent role in statistics
  • Some sort of “normality” is often the prerequisite of many statistical tools
  • With normal data, the mean is the value with higher likelihood
  • Reasoning on the mean is equivalent is reasoning on the most probable value …

Sampling Distribution

The distribution of the mean of a sample extracted from every type of distribution is always normal

Sampling Distribution

Noteworthy

  1. The sampling distribution get’s narrower as the sample increases
  2. This is the ultime reasons why measuring more samples leads to smaller p-values

Formally

  • Mean of the sampling distribution: \(\mu_{pop}\)
  • Variance of the sampling distribution: \(\frac{\sigma}{\sqrt(N)}\)

Where \(N\) is the size of the sample

… but

We are often dealing with data non normally distributed

  1. Count based technologies never return negative numbers … (MS!)
  2. Presence of sub-populations
  3. Outliers

Sub populations are typically present in complex experiments

Subpopulations

Lognormal data

Lognormal data are extremely common in metabolomics …

  • The mean is not the most probable value!
  • Statistical machinery focusing on mean is not the right tool for the trade

Solutions

  1. Non-parametric approaches (Kruskall-Wallis, quantile-regression, permutations, bootstrap, …)
  2. Variable transformations

Remember that non parametric tests suffer of lack of power and are often completely useless fin investigations with only few samples (5-10)

Mean and median are ok!

Checking Normality

  1. Normality tests … again statistics ;-)
  2. Graphical methods q-q plots

Quantile - quantile plots

Quantile quantile plots are used to visually compare the theoretical quantiles of a distribution with the sample quantiles

q-q plot for lognormal data

Take home message

Before running your statistical machinery give a look to your data to check if what you are doing makes sense …