Stratified Random Sampling

Intro

During the presentation I affirmed that in presence of known subpopulations stratified random sampling is able to give a more accurate estimate of the properties of my reference population. Let’s check my statement …

The question

Suppose that I’m interested in estimating the average height in a population of 10000 people and that I know that out of them 8000 are males and 2000 are females. To create my population I use the following properties:

Males: mean = 180 cm, sd = 10 cm
Females: mean = 150 cm, sd = 10 cm

Running the simulations

First of all I create my population

## separate tibbles
males <- tibble(gender = "M", height = rnorm(8000,180,10))
females <- tibble(gender = "F", height = rnorm(2000,150,10))

## merge everything in a df
mypop <- males %>% 
  bind_rows(females)

Just for fun let’s give a look

mypop %>% 
  ggplot() + 
  geom_histogram(aes(x = height, fill = gender), col = "white",  alpha = 0.4, position = "identity")  +
  theme_light()

Ok, we see what we expect, the two groups are different, and there is an unbalancing.

The problem

Ok suppose now that I’m interested in estimating well the population mean. In my case this parameter is known …

pop_mu <- mean(mypop$height)
pop_mu

## [1] 174.1062

In the plot it stays there …

mypop %>% 
  ggplot() + 
  geom_histogram(aes(x = height, fill = gender), col = "white",  alpha = 0.6, position = "identity") + 
  geom_vline(xintercept = pop_mu, col="red", lty = 2) + 
  theme_light()

Suppose also that I do not have money to cover the full population, so the best I can do is to interview 30 people. So I’ll use their mean height to estimate the mean of the population.

What I can do in my simulation (and that I cannot do in my real life) is to ask myself how variable is my estimation of the population mean when I repeat the sampling several times

Simple Random Sampling

Suppose that I have no clue about the presence of the two sub-populations, so the best I can do to avoid biases is simple random sampling. Remember, random is my best way to avoid introducing biases.

So let’s now simulate repeatedly the extraction of 30 people from my population and to see the variability of the process I do it 500 times

## start creating the tibble container
x <- tibble(sampid = paste0("s",1:500))
head(x)

## # A tibble: 6 × 1
##   sampid
##   <chr> 
## 1 s1    
## 2 s2    
## 3 s3    
## 4 s4    
## 5 s5    
## 6 s6

## here we do the complete random sampling ...
x <- x %>% 
  mutate(random_samples = map(sampid, ~sample(x = mypop$height,size = 30)))
head(x)

## # A tibble: 6 × 2
##   sampid random_samples
##   <chr>  <list>        
## 1 s1     <dbl [30]>    
## 2 s2     <dbl [30]>    
## 3 s3     <dbl [30]>    
## 4 s4     <dbl [30]>    
## 5 s5     <dbl [30]>    
## 6 s6     <dbl [30]>

## now we calculate the averages
x <- x %>% 
  mutate(random_averages = map_dbl(random_samples, ~mean(.x)))
head(x)

## # A tibble: 6 × 3
##   sampid random_samples random_averages
##   <chr>  <list>                   <dbl>
## 1 s1     <dbl [30]>                171.
## 2 s2     <dbl [30]>                174.
## 3 s3     <dbl [30]>                173.
## 4 s4     <dbl [30]>                175.
## 5 s5     <dbl [30]>                172.
## 6 s6     <dbl [30]>                174.

And we plot the distribution of the estimates as a reference with the true population mean

x %>% 
  ggplot() + 
  geom_point(aes(x = sampid, y = random_averages)) + 
  geom_hline(yintercept = pop_mu, col = "red", lty = 2) + 
  theme_light() + 
  theme(axis.text.x = element_blank())

The same plot as a histogram …

x %>% 
  ggplot() + 
  geom_histogram(aes(x = random_averages), alpha = 0.5, col = "white") + 
  geom_vline(xintercept = pop_mu, col = "red", lty = 2) + 
  theme_light()

As we can see, the estimate has variability (:-P), and it is centered around the correct value for the population mean. It is also normally distributed, but this you already know from the previous practical …

Stratified sampling

Suppose now that I’m aware of the fact that my population is composed by different subgroups. To take into account that I switch to stratified random sampling

If I’m sticking with 30 people I have to allocate them to the two groups in a proportion that mirrors the proportions of the individuals in the population

nsamp <- 30

## here I calculate the fraction
subfractions <- ceiling((table(mypop$gender)/nrow(mypop))*nsamp)

## just to show how much easy to read is the coding with pipes
library(magrittr)

another_subfraction <- mypop$gender %>% 
  table(.) %>% 
  divide_by(nrow(mypop)) %>% 
  multiply_by(nsamp) %>% 
  ceiling(.)

subfractions

## 
##  F  M 
##  6 24

Just to check let’s look to the proportions

subfractions[1]/subfractions[2]

##    F 
## 0.25

2000/8000

## [1] 0.25

Let’s now code this new way of sampling

## here we do the complete random sampling of the two populations
## and we return a vector with the samples extracted in a stratified way
x <- x %>% 
  mutate(stratified_random_samples = map(sampid, function(n) {
    out_males <- mypop %>% filter(gender == "M") %>% pull(height) %>% sample(.,subfractions["M"])
    out_females <- mypop %>% filter(gender == "F") %>% pull(height) %>% sample(.,subfractions["F"])
    return(c(out_males,out_females))
  }))
head(x)

## # A tibble: 6 × 4
##   sampid random_samples random_averages stratified_random_samples
##   <chr>  <list>                   <dbl> <list>                   
## 1 s1     <dbl [30]>                171. <dbl [30]>               
## 2 s2     <dbl [30]>                174. <dbl [30]>               
## 3 s3     <dbl [30]>                173. <dbl [30]>               
## 4 s4     <dbl [30]>                175. <dbl [30]>               
## 5 s5     <dbl [30]>                172. <dbl [30]>               
## 6 s6     <dbl [30]>                174. <dbl [30]>

Now, the average of the samples will be another time the estimate of the population mean

## now we calculate the averages
x <- x %>% 
  mutate(stratified_averages = map_dbl(stratified_random_samples, ~mean(.x)))
head(x)

## # A tibble: 6 × 5
##   sampid random_samples random_averages stratified_random_samples
##   <chr>  <list>                   <dbl> <list>                   
## 1 s1     <dbl [30]>                171. <dbl [30]>               
## 2 s2     <dbl [30]>                174. <dbl [30]>               
## 3 s3     <dbl [30]>                173. <dbl [30]>               
## 4 s4     <dbl [30]>                175. <dbl [30]>               
## 5 s5     <dbl [30]>                172. <dbl [30]>               
## 6 s6     <dbl [30]>                174. <dbl [30]>               
## # ℹ 1 more variable: stratified_averages <dbl>

And now we plot!

x %>% 
  ggplot() + 
  geom_point(aes(x = sampid, y = random_averages), col = "steelblue", alpha = 0.5) + 
  geom_point(aes(x = sampid, y = stratified_averages), col = "orange", alpha = 0.5) + 
  geom_hline(yintercept = pop_mu, col = "red", lty = 2) + 
  theme_light() + 
  theme(axis.text.x = element_blank())

As before in terms of a histogram

x %>% 
  ggplot() + 
  geom_histogram(aes(x = random_averages), fill = "blue", alpha = 0.5, col = "white") + 
  geom_histogram(aes(x = stratified_averages), fill = "red", alpha = 0.5, col = "white") + 
  geom_vline(xintercept = pop_mu, col = "red", lty = 2) + 
  theme_light()

As you can see the average is still good, but the variability (variance) of my estimate is much smaller This is true for all the statistics you would like to estimate from your sample. If you think to it, the reason for this fact is that using stratification we ensure that our sample is representative of both subpopulations.

For our metabolomics business knowing this is important when you are dealing with observational studies. Because with experiment you are in general allocating experimental units as you want.

Something for You

Check what happens to the two types of sampling if you change the unbalancing in the population ?
Check what happens to the two types of sampling if you change the size of the sample (here 30)?

Both tasks can be dealt either by changing my code above and looking to the results or everything can be organized in a more “scientific” way by calculating the variance of the estimates and wrapping the code in some sort of loop which will allow you to conveniently show your results in a plot.

Effects of the sample size

Let’s try to organize everything in a table. We make a plan for that

one column for the sample size
one column with the random sampling
two columns for the fractions males/females
two columns for the numbers of males and females
one column with the result of stratified sampling
then the means
then the summary for the SD and a plot

## we split data in males and females

mypop_m <- rnorm(8000,180,10)  ## Extracting 8000 numbers from a gaussian population with mean 180 and sd 10
mypop_f <- rnorm(1000,150,10)

Let’s join these two subpopulationsinto one population

mypop <- c(mypop_m,mypop_f)  ## Concatenating the two vectors


hist(mypop)

This will be my reference population. From this pop I can calculate the mean of the number. Mind that this mean is the true population mean

mean(mypop)

## [1] 176.6231

I can also plot it

## this plot is mad by using "base" R plotting utilities
hist(mypop)
abline(v = mean(mypop), 
       col = "red", 
       lty = 4, ## the type of the line
       lwd = 2)

Now I have my population, I want now to simulate the sampling from that population. So basically I want to estimate the mean of my population from a group of samples extracted from mypop

If you think to it, every sample will give you a slightly different estimate of the population mean

## Example of sampling

pop_mean <-  
  mean(mypop)

pop_mean

## [1] 176.6231

Trust me, the variability of the estimate of the mean, decreases as the size of the sample increases. The distribution of the means (called sampling distribution of the mean) is normally distributed.

What to do now is to compare the spread of the estimates of the population mean when sampling is performed in a “random” fashion or in a stratified fashion

In particular I want to test both the effect of the stratification and the size of my sample

I now create a vector containing the sample sizes I’ll check

## Set of sizes we will consider
sizes <- c(30,50,100,500)

sample_size_effect <- tibble(s_size = rep(c(sizes), each = 500)) %>% 
  mutate(m_fraction = length(mypop_m)/(length(mypop))) %>% 
  mutate(f_fraction = 1 - m_fraction) %>% 
  mutate(across(contains("fraction"), function(s) ceiling(s*s_size),.names = "sample_{.col}")) %>% 
  #mutate(across(contains("fraction"),~ ceiling(.x*s_size),.names = "sample_{.col}")) %>% 
  mutate(random_samplings = map(s_size, ~ sample(mypop, size = .x))) %>% 
  mutate(stratified_samplings = map2(sample_m_fraction,sample_f_fraction, function(a,b) c(sample(mypop_m,a), sample(mypop_f,b)))) %>%
  # mutate(stratified_samplings = map2(sample_m_fraction,sample_f_fraction,~ c(sample(mypop_m,.x), sample(mypop_f,.y)))) %>%
  mutate(mean_random = map_dbl(random_samplings,~mean(.x))) %>% 
  mutate(mean_strat = map_dbl(stratified_samplings,~mean(.x)))

sample_size_effect %>% 
  ggplot() + 
  geom_histogram(aes(x = mean_random), 
                 col = "white",    ## color of the border
                 fill = "steelblue",  ## color of the bars
                 alpha = 0.5) +  ## transparency
  geom_histogram(aes(x = mean_strat), col = "white", fill = "orange", alpha = 0.5) + 
  geom_vline(xintercept = pop_mean, lty = 2, col = "red") + 
  facet_wrap(~factor(s_size), scales = "free")

So in all cases stratifying is better …

sample_size_effect %>% 
  group_by(s_size) %>% 
  summarize(sd_random = sd(mean_random),
            sd_strata = sd(mean_strat)) %>% 
  mutate(sdratio = sd_random/sd_strata)

## # A tibble: 4 × 4
##   s_size sd_random sd_strata sdratio
##    <dbl>     <dbl>     <dbl>   <dbl>
## 1     30     2.52      1.77     1.43
## 2     50     2.01      1.41     1.42
## 3    100     1.35      0.982    1.38
## 4    500     0.631     0.446    1.41

So the ratio is constant. This means that it is always advantageous to use stratified sampling. * Our estimate of the “true” mean is less variable. * It is less likely to have an estimate completely off * Can you check the role of the unbalancing in the populations?