Statistical Testing and Effect Size

The BIG question

⚠️ The question:
Is what I’m observing true beyond my sample. Can I draw general conclusions from a limited set samples?

In presence of variability, there will always be the possibility that what I observe in my data cannot be generalized at the population level.

measure more sample
validate
give a measure of my confidence on the results

Statistical testing

Due to variability, it is impossible to get certain answers from an experiment. The best one can due is to try to quantify the level of confidence.

Statistical testing is a procedure which allows us to quantify this level of confidence

The concept

Suppose that what we observe is the result of chance alone (Null Hypothesis - H0)
Use statistics to calculate the probability of getting at least what we observe under H0 (by chance!) (p-value)
Set a threshold of reasonable confidence (0.05,0.01, …)

Example: lowering cholesterol

Suppose that the level of cholesterol in the population is normally distributed with mean 200 and standard deviation 50
We claim that a new secret drug we recently invented reduces significantly the cholesterol level in the population
To prove that we get a group of 50 people, we treat them with the drug and we measure their average cholesterol level. This mean turns out to be 193.
Is this pilot study supporting my claim ? i.e. is the ~~difference between 193 and 200 significant~~?

Let’s test that!

Suppose that the drug has no effect (H0) … but then my 50 people are a random drawing from the population
Calculate the distribution of the mean level of cholesterol in groups of 50 people coming from the population.(mind the gap! … this is not the distribution of cholesterol in the population!)
Calculate the probability of obtaining less than 193 from this distribution (p-value)
Reject the H0 if the p-value is lower than the selected threshold (typically 0.05)

Let’s plot it!

What we see

The means of the different groups are different
The distribution of the means is nicely centered around the population mean!
The blue line represents the mean of my sample 50 people (193!)
The blue dots are “random” draws from the population which show an average level of cholesterol lower than the one we observed
For blue dots we observe less than 193 by chance
Apparently getting at least that value only by chance is not extremely unlikely … 14 blue dots out 100 (p = 0.14 !)
~~I cannot reject H0 at the 0.05 level … ~~

More on this …

So you see … we need a probability and we also need a threshold (0.05)

The correct wording in my paper should be

” … the results are not statistically significant at the 0.05 level of confidence” …

~~but they are significant at the 0.15 level! ~~

More samples!

I’m stubborn, so I’m convinced that the drug is really working. We redo the same study with a test group of 500 people and we observe another time an average level of 193 …

~~Magic! Now it is significant at the 0.05! No more blue dots! So it is extremely unlikely to get an average cholesterol level of 193 in a group of 500 people if the drug has no lowering effect~~

Take home messages

The choice of a ~~threshold (here 0.05) for statistical significance is arbitrary~~
With more samples we can “see” smaller differences
We are never sure!
Is statistical significance the only thing we are looking for?

Back to our magic drug …

Unfortunately it turns out that our drug is not so good … apparently it reduces the cholesterol only of 0.01%

Is a low p-value the only thing we need?

Is a reduction of 0.01% really useful/relevant?
Big number of samples will make tiny differences statistically significant!
Statistical significance does not mean biological/agronomic/medical relevance
The p-value alone cannot be used to judge the relevance of a research …

Erroneous …

p-values deals with probability of obtaining by chance, not with the strength of an effect
strong effects with low variability will result in low ps
the reverse is not necessarily true!
“look, I have a low p-value!” is not the only thing to look for

Measuring the effect size

Notes

The difference in means is not sufficient
The measure should take into account of the variability
The variability of the population not the one of he sampling distribution ;-)

The fold change is not a good measure of the effect size …

Cohen’s d

\[ d = \frac{\bar{x}_{1} - \bar{x}_{2}}{s} \]

Where

\(\bar{x}\) are the estimates of the population means
\(s\) is the estimate of the population standard deviation (pooled)

Let’s see

2 populations: \(\mu_{1}=5\), \(\mu_{2}=10\), \(\sigma=5\)
t-test to test the difference
different sample sizes

Notes

With three samples variability is large
Also the possibility of calling non significant the difference is large
The effect size does not tell to me if something is relevant