False Positives

library(tidyverse)

This demo deals with “random organization”. As we discussed in the lecture, an experiment is always performed on a subset of the population, so a measured “organization”

can mirror a population level property
can be the result of a “unlucky” sampling

To understand the second point, just consider an hypothetical sampling of the human population made by an extraterrestrial entity. If he/she (it?) random sample a couple of humans, it will likely conclude that we are all Chinese!

Let’s make a more scientific example. Let’s create a fake metabolomics experiment (but it can be everything you like …) where I’m measuring 1000 independent variables on a group of samples. And let’s suppose that I want to study the correlation between the different variables to infer new regulatory networks …

To simulate my dummy experiment I will fill my data matrices with random numbers, to show the role of the sample/variable ratio in this business I’ll also simulate my dummy data with different number of samples

## Random expression data matrix
dummy_data <- tibble(nsamples = seq(10,500,10)) %>% 
  mutate(DM = map(nsamples,function(x) matrix(rnorm(1000*x), ncol = 1000)))

head(dummy_data)

## # A tibble: 6 × 2
##   nsamples DM                
##      <dbl> <list>            
## 1       10 <dbl [10 × 1,000]>
## 2       20 <dbl [20 × 1,000]>
## 3       30 <dbl [30 × 1,000]>
## 4       40 <dbl [40 × 1,000]>
## 5       50 <dbl [50 × 1,000]>
## 6       60 <dbl [60 × 1,000]>

That’s the battery of data matrices of growing size. Do I have some “true” results inside my data?

NO!

OK, let’s calculate the correlation among the different variables in all the 40 scenarios

## I'm using purrr

dummy_data <- dummy_data %>% 
  mutate(varCorr = map(DM,function(f) cor(f)))

Just to see what’s going on let’s give a look to one of the correlation matrices, actually only to the first 100 variables

dim(dummy_data$varCorr[[3]])

## [1] 1000 1000

jet.colors <- colorRampPalette(c("#00007F", "blue", "#007FFF", "cyan", "#7FFF7F", "yellow", "#FF7F00", "red", "#7F0000"))

par(pty="s")
image(t(dummy_data$varCorr[[3]][1:100,1:100]), xaxt = "n", yaxt = "n", col = jet.colors(20))

Right, in the matrix the color of each pixel is proportional to the correlation between two variables (here only 100 are shown). The diagonal is made up of 1s, because the correlation between each variable and itself is 1. The matrix is symmetric, since the measure of Pearson correlation is symmetric.

The plot is rotated … this is an idiosyncrasy of the image basic plot … however

Now we get only the upper triangle of the matrix in order to look to the distribution of the correlation coefficients

dummy_data <- dummy_data %>% 
  mutate(uptri = map(varCorr,~.x[upper.tri(.x)]))

Let’s now start from the most “unfavourable” situation: 10 samples and 1000 variables

hist(dummy_data$uptri[[1]], 
     col = "steelblue", 
     breaks = 100,
     main = "Random Correlations")
abline(v = c(-0.8,0.8), col = "red", lty = 2)

Unexpectedly ( ;-) ) … we also have variables showing high correlation! The vertical lines show the 0.8 limit …

In absolute terms

range(dummy_data$uptri[[1]])

## [1] -0.9725758  0.9887498

!!!!

The code below gives you the indexes of the variables showing high correlation (excluding the 1s)

highcorrid <- which(dummy_data$varCorr[[1]] > 0.95 & dummy_data$varCorr[[1]] != 1, arr.ind = TRUE)
highcorrid

##       row col
##  [1,]  61  10
##  [2,]  10  61
##  [3,] 252  98
##  [4,] 464 179
##  [5,]  98 252
##  [6,] 813 399
##  [7,] 574 408
##  [8,] 179 464
##  [9,] 408 574
## [10,] 966 610
## [11,] 973 785
## [12,] 399 813
## [13,] 610 966
## [14,] 785 973

par(pty="s")
plot(dummy_data$DM[[1]][,highcorrid[1,1]],dummy_data$DM[[1]][,highcorrid[1,2]], 
     xlab = paste0("Var ",highcorrid[1,1]), 
     ylab = paste0("Var ",highcorrid[1,2]),
     col = "darkred", pch = 19, cex = 2)

Here the presence of high correlation is clear!!!

But this correlation does not have a “biological” origin … so it does not represent a truly scientific result …

Before giving you the chance to play with some random science, let’s look oh what happens to the histogram of the correlation coefficients as the number of samples increases …

dummy_data %>% 
  filter(nsamples %in% c(10,100,300, 500)) %>% 
  dplyr::select(nsamples, uptri) %>%                  ## this is needed because a function called `select` is abiguous
  unnest(uptri) %>% 
  ggplot() + 
  geom_histogram(aes(x = uptri, fill = factor(nsamples)), alpha = 0.5, col = "white", binwidth = 0.05) + 
  geom_vline(xintercept = c(-0.8,0.8), col  ="red", lty = 2) + 
  facet_wrap(vars(nsamples), scales = "free_y") + 
  theme_light()

What we see here is interesting, the probability of getting high correlation coefficients by chance decreases with the number of samples. As expected, measuring more gives to me a more realistic view of reality …

Something for you!

Play around and try to understand what I did
Simulate an experiment where I have two classes of samples (10 treated, 10 controls) and I’m characterizing them with an array of new machines able to measure 5, 10, 100, 1000, 10000 variables. Fill each data matrix with random numbers. In each case calculate the p-value of an univariate t-test for each variable (you will have 5, 10, 1000, 10000 p-values). Do you find p-values lower than 0.05? Can you plot the histograms of the p-values?

Many of the function you need to do that are present in my demo. In addition:

ex2 <- tibble(nvar = c(5,10,100,1000,10000))
ex2

## # A tibble: 5 × 1
##    nvar
##   <dbl>
## 1     5
## 2    10
## 3   100
## 4  1000
## 5 10000

ex2 <- ex2 %>% 
  mutate(A = map(nvar,function(x) matrix(rnorm(x*10), nrow = 10)))

ex2

## # A tibble: 5 × 2
##    nvar A                  
##   <dbl> <list>             
## 1     5 <dbl [10 × 5]>     
## 2    10 <dbl [10 × 10]>    
## 3   100 <dbl [10 × 100]>   
## 4  1000 <dbl [10 × 1,000]> 
## 5 10000 <dbl [10 × 10,000]>

ex2 <- ex2 %>% 
  mutate(B = map(nvar,function(x) matrix(rnorm(x*10), nrow = 10)))

ex2

## # A tibble: 5 × 3
##    nvar A                   B                  
##   <dbl> <list>              <list>             
## 1     5 <dbl [10 × 5]>      <dbl [10 × 5]>     
## 2    10 <dbl [10 × 10]>     <dbl [10 × 10]>    
## 3   100 <dbl [10 × 100]>    <dbl [10 × 100]>   
## 4  1000 <dbl [10 × 1,000]>  <dbl [10 × 1,000]> 
## 5 10000 <dbl [10 × 10,000]> <dbl [10 × 10,000]>

myt <- function(m,n) {
  ps <- rep(0,ncol(m))
  for (i in 1:length(ps)){
    myt <- t.test(m[,i],n[,i])
    ps[i] <- myt$p.value
  }
  return(ps)
}

## map2 is the extension of map the case in which the new column is created combining 
## the content of other two columns ...
ex2 <- ex2 %>% 
  mutate(ps = map2(A,B,function(x,y) myt(x,y)))

ex2

## # A tibble: 5 × 4
##    nvar A                   B                   ps            
##   <dbl> <list>              <list>              <list>        
## 1     5 <dbl [10 × 5]>      <dbl [10 × 5]>      <dbl [5]>     
## 2    10 <dbl [10 × 10]>     <dbl [10 × 10]>     <dbl [10]>    
## 3   100 <dbl [10 × 100]>    <dbl [10 × 100]>    <dbl [100]>   
## 4  1000 <dbl [10 × 1,000]>  <dbl [10 × 1,000]>  <dbl [1,000]> 
## 5 10000 <dbl [10 × 10,000]> <dbl [10 × 10,000]> <dbl [10,000]>

False Positives

Pietro Franceschi