The BIG question

Is what I’m observing true beyond my sample. Can I draw general conclusions from a limited set samples?


In presence of variability, there will always be the possibility that what I observe in my data cannot be generalized at the population level.

  • measure more sample
  • validate
  • give a measure of my confidence on the results

On Variability

The overall variability comes from the sum of the different sources of variability


Sampling Distribution

  • We are always dealing with a group of “individuals” extracted from an unknown population: a Sample
  • From each sample we calculate summary statistics, typically the mean
  • And we ask ourselves isomething significant happens to the mean (e.g. are the two meand different?)


  • The distribution of the sample means is always normally distributed
  • The width of the distribution of the sample mean decreases as the sample size increases

More samples …

More samples, narrower distribution.

Statistical testing

Due to variability, it is impossible to get certain answers from an experiment. The best one can due is to try to quantify the level of confidence.

Statistical testing is a procedure which allows us to quantify this level of confidence

The concept

  1. Suppose that what we observe is the result of chance alone (Null Hypothesis - H0)
  2. Use statistics to calculate the probability of getting at least what we observe under H0 (by chance!) (p-value)
  3. Set a threshold of reasonable confidence (0.05,0.01, …)

Example: lowering cholesterol

  • Suppose that the level of cholesterol in the population is normally distributed with mean 200 and standard deviation 50
  • We claim that a new secret drug reduces significantly the cholesterol level in the population
  • To prove that we get a sample of 50 people, we treat them with the drug and we measure their average cholesterol level. This mean turns out to be 193.
  • Is this pilot study supporting my claim ? i.e. is the difference between 193 and 200 significant?

Let’s test that!

  1. Suppose that the drug has no effect (H0) … but then my 50 people are a random drawing from the population
  2. Calculate the distribution of the mean level of cholesterol in groups of 50 people coming from the population.(mind the gap! … this is not the distribution of cholesterol in the population!)
  3. Calculate the probability of obtaining less than 193 from this distribution (p-value)
  4. Reject the H0 if the p-value is lower than the selected threshold (typically 0.05)

Let’s plot it!

What we see

  • The means of the different groups are different
  • The distribution of the means is nicely centered around the population mean!
  • The blue line represents the mean of my sample 50 people (193!)
  • The blue dots are “random” draws from the population which show an average level of cholesterol lower than the one we observed
  • For blue dots we observe less than 193 by chance
  • Apparently getting at least that value only by chance is not extremely unlikely … 14 blue dots out 100 (p = 0.14 !)
  • I cannot reject H0 at the 0.05 level …

More on this …

So you see … we need a probability and we also need a threshold (0.05)


The correct warding in my paper should be

” … the results are not statistically significant at the 0.05 level of confidence”

but they are significant at the 0.15 level!

More samples!

I’m stubborn, so I’m convinced that the drug is really working. We redo the same study with a test group of 500 people and we observe another time an average level of 193

Magic! Now it is significant at the 0.05! No more blue dots! So it is extremely unlikely to get an average cholesterol level of 193 in a group of 500 people if the drug has no lowering effect

Take home messages


  • The choice of a threshold (here 0.05) for statistical significance is arbitrary
  • With more samples we can “see” smaller differences
  • We are never sure!
  • Is statistical significance the only thing we are looking for?

Back to our magic drug …

Unfortunately it turns out that our drug is not so good … apparently it reduces the cholesterol only of 0.01%

Is a low p-value the only thing we need?

  • Is a reduction of 0.01% really useful/relevant?
  • Big number of samples will make tiny differences statistically significant!
  • Statistical significance does not mean biological/agronomic/medical relevance
  • The p-value alone cannot be used to judge the relevance of a research …

Erroneous …

  • p-values deals with probability of obtaining by chance, not with the strength of an effect
  • strong effects with low variability will result in low ps
  • the reverse is not necessarily true!
  • “look, I have a low p value!” is not the only thing to look for

Measuring the effect size

Notes

  1. The difference in means is not sufficient
  2. The measure should take into account of the variability
  3. The variability of the population not the one of he sampling distribution ;-)

The fold change is not a good measure of the effect size …

Cohen’s d

\[ d = \frac{\bar{x}_{1} - \bar{x}_{2}}{s} \]

Where

  • \(\bar{x}\) are the estimates of the population means
  • \(s\) is the estimate of the population standard deviation (pooled)

Let’s see

  • 2 populations: \(\mu_{1}=5\), \(\mu_{2}=10\), \(\sigma=5\)
  • t-test to test the difference
  • different sample sizes

Notes

  • With three samples variability is large
  • Also the possibility of calling non significant the difference is large
  • The effect size does not tell to me if something is relevant