library(tidyverse)
library(readxl)
KOMP Plasma Metabolomics Dataset
Mouse knockouts facilitate the study of gene functions. Often, multiple abnormal phenotypes are induced when a gene is inactivated. The International Mouse Phenotyping Consortium (IMPC) has generated thousands of mouse knockouts and cataloged their phenotype data. We have acquired metabolomics data from 220 plasma samples from 30 unique mouse gene knockouts and corresponding wild-type mice from the IMPC. To acquire comprehensive metabolomics data, we have used liquid chromatography (LC) combined with mass spectrometry (MS) for detecting polar and lipophilic compounds in an untargeted approach. We have also used targeted methods to measure bile acids, steroids and oxylipins. In addition, we have used gas chromatography GC-TOFMS for measuring primary metabolites. The metabolomics dataset reports 832 unique structurally identified metabolites from 124 chemical classes as determined by ChemRICH software
In this demo we will use only the metabolites quantified in the targeted assays. The belong to the chemical classes of bile acids, steroids and oxylipins
Here we get the data
load("../data/KOMP_data_targeted.RData") ## the initial data were saved in RData format
We have now three files
The amount of info in the table is uncommon
colnames(metabolite_meta)
## [1] "CompoundID" "Order" "DataBaseID"
## [4] "Type" "Assay" "AnnotationApproach"
## [7] "CompoundName" "IKFirstBlock" "InChiKey"
## [10] "RetentionTime" "MZ" "SMILES"
## [13] "KEGGID" "PubChemID" "QC_RSD"
## [16] "BlankRatio" "AcylChain"
What is interesting for us is to see how much effort has been made to precisely identify the compounds. The problem of “naming” is typical of analytical chemistry and metabolomics.
metabolite_meta$InChiKey[1]
## [1] "RUDATBOHQWOJDD-BSWAIDMHSA-N"
Some of the identifier refer to publicly available databases. This is the general trend of this scientific discipline.
KEGGID refers to kegg so it can be used to link the metabolite to its position within metabolic networks.
The second tibble contains the description of the sample metadata
head(sample_meta_data)
## # A tibble: 6 × 4
## MouseID Genotype Gender Zygosity
## <dbl> <chr> <chr> <chr>
## 1 413741 Pebp1 Male Homo-
## 2 413742 Pebp1 Male Homo-
## 3 413745 Pebp1 Male Homo-
## 4 428689 Pebp1 Female Homo-
## 5 428693 Pebp1 Female Homo-
## 6 433002 Pebp1 Female Homo-
The field are auto explanatory. The MouseID is the code we need to associate the sample to the class
First of all we give a look to the experimental design, just to understand the numbers we are dealing with
table(sample_meta_data$Gender,sample_meta_data$Genotype)
##
## A2m Ahcy Atp5b Atp6v0d1 C8a Cdk4 Dhfr Dync1li1 G6pd2 Galc Gnpda1 Idh1
## Female 3 3 3 3 3 3 3 3 3 3 3 3
## Male 3 3 3 3 3 3 3 3 3 3 3 3
##
## Iqgap1 Lmbrd1 Mfap4 Mmachc Mvk Nek2 Npc2 Null Pebp1 Phyh Pipox Plk1
## Female 3 3 3 3 3 3 3 20 3 3 3 3
## Male 3 3 3 3 3 3 3 20 3 3 3 3
##
## Pmm2 Ptpn12 Pttg1 Rock1 Sra1 Ulk3 Ywhaz
## Female 3 3 3 3 3 3 3
## Male 3 3 3 3 3 3 3
head(DM[,1:20])
## # A tibble: 6 × 20
## CompoundID KOMP_413741 KOMP_413742 KOMP_413745 KOMP_428689 KOMP_428693
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CPD000001 124. 50.6 67.0 227. 101.
## 2 CPD000002 63.7 88.3 38.9 74.9 16.0
## 3 CPD000003 15.1 64.0 21.8 84.7 69.4
## 4 CPD000004 116. 549. 71.6 195. 526.
## 5 CPD000005 NA NA NA 0.55 0.87
## 6 CPD000006 NA 0.53 1 NA 1.49
## # ℹ 14 more variables: KOMP_433002 <dbl>, KOMP_185262 <dbl>, KOMP_115900 <dbl>,
## # KOMP_115996 <dbl>, KOMP_138110 <dbl>, KOMP_138120 <dbl>, KOMP_190701 <dbl>,
## # KOMP_143866 <dbl>, KOMP_143867 <dbl>, KOMP_143869 <dbl>, KOMP_144064 <dbl>,
## # KOMP_123951 <dbl>, KOMP_123953 <dbl>, KOMP_487464 <dbl>
Since our mouse ID is a number without the leading “KOMP_” we fix this on the fly
DM <- DM %>%
column_to_rownames("CompoundID") %>%
t(.) %>%
as_tibble(rownames = "Sampname") %>%
separate(Sampname, c("One","MouseID"),"_") %>% ## split the column
dplyr::select(-One)
Now the matrix looks like this
head(DM[,1:20])
## # A tibble: 6 × 20
## MouseID CPD000001 CPD000002 CPD000003 CPD000004 CPD000005 CPD000006 CPD000007
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 413741 124. 63.7 15.1 116. NA NA 140.
## 2 413742 50.6 88.3 64.0 549. NA 0.53 325.
## 3 413745 67.0 38.9 21.8 71.6 NA 1 59.0
## 4 428689 227. 74.9 84.7 195. 0.55 NA 110
## 5 428693 101. 16.0 69.4 526. 0.87 1.49 318.
## 6 433002 357. 126. 186. 146. NA 6.93 115.
## # ℹ 12 more variables: CPD000008 <dbl>, CPD000009 <dbl>, CPD000010 <dbl>,
## # CPD000011 <dbl>, CPD000012 <dbl>, CPD000013 <dbl>, CPD000014 <dbl>,
## # CPD000015 <dbl>, CPD000016 <dbl>, CPD000792 <dbl>, CPD000793 <dbl>,
## # CPD000794 <dbl>
Now I fix the sample name in the sample_meta
sample_meta_data <- sample_meta_data %>%
mutate(MouseID = as.character(MouseID))
To check that we organize the matrix in a different way also joining the concentration data with the sample metadata
DM_nest <- DM %>%
left_join(sample_meta_data) %>%
pivot_longer(starts_with("CPD"), names_to = "CompoundID", values_to = "I") %>%
nest(data = -CompoundID)
This data structure highlights the role of each metabolite
head(DM_nest)
## # A tibble: 6 × 2
## CompoundID data
## <chr> <list>
## 1 CPD000001 <tibble [220 × 5]>
## 2 CPD000002 <tibble [220 × 5]>
## 3 CPD000003 <tibble [220 × 5]>
## 4 CPD000004 <tibble [220 × 5]>
## 5 CPD000005 <tibble [220 × 5]>
## 6 CPD000006 <tibble [220 × 5]>
the intensities of each metabolite are stored in the “data” column
head(DM_nest$data[[1]])
## # A tibble: 6 × 5
## MouseID Genotype Gender Zygosity I
## <chr> <chr> <chr> <chr> <dbl>
## 1 413741 Pebp1 Male Homo- 124.
## 2 413742 Pebp1 Male Homo- 50.6
## 3 413745 Pebp1 Male Homo- 67.0
## 4 428689 Pebp1 Female Homo- 227.
## 5 428693 Pebp1 Female Homo- 101.
## 6 433002 Pebp1 Female Homo- 357.
The new tibble can be efficiently used to inspect the data in a univariate perspective …
For example, the number of NAs in relation of the median intensity can be visualized as
DM_nest %>%
mutate(nnas = map_dbl(data,function(p) sum(is.na(p$I))), ## calculate the univariate statistics "on the fly"
medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>%
filter(nnas > 0) %>% ## we keep only features with at least one NA!
ggplot() + ## plot them
geom_point(aes(x = medI, y = nnas)) +
scale_x_log10() +
theme_light()
Which actually shows what we expected. High number of missing values are visible only with low median intensities … note the log scale in x.
This is telling us that there is a distinction of metabolites in two groups showing a marked difference in concentration
Is this effect dependent on the class of metabolite?
library(plotly)
ggplotly(DM_nest %>%
left_join(metabolite_meta) %>%
mutate(nnas = map_dbl(data,~sum(is.na(.x$I))), ## calculate the univariate statistics "on the fly"
medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>%
filter(nnas > 0) %>% ## we keep only features with at leas on NA!
ggplot(aes(x = medI, y = nnas, col = Assay, text = CompoundName)) + ## plot them
geom_point() +
scale_x_log10() +
scale_color_brewer(palette = "Set2") +
theme_light(), tooltip="text")
That’s strange … the largest number of NAs is corresponding to lower intensities for the Bile Acids_Steroids and not for Oxylipins. This is curious
It is worth highlighting the name of the metabolites showing a number of NAs larger than 50 …
DM_nest %>%
left_join(metabolite_meta) %>%
mutate(nnas = map_dbl(data,~sum(is.na(.x$I))), ## calculate the univariate statistics "on the fly"
medI = map_dbl(data,~median(.x$I, na.rm = TRUE))) %>%
filter(nnas > 0) %>% ## we keep only features with at leas on NA!
ggplot(aes(x = medI, y = nnas, col = Assay)) + ## plot them
geom_point(size = 2) +
geom_text(aes(label=ifelse(nnas>50,CompoundName,'')), size = 4, nudge_y = -3) +
scale_x_log10() +
scale_color_brewer(palette = "Set2") +
theme_light()
As we discussed in the lecture, it is difficult to decide which metabolites have too many missing values …
When we have a “limited” number of variables the best approach is to
plot the data. With the DM_nest
we can do that quite
efficiently
To do that I create a custom plotting function which takes as input the tibble of the data and a potential title
myplot <- function(t,l = ""){
p <- t %>%
ggplot() +
geom_jitter(aes(x = Genotype, y = I, color = Gender, shape = Zygosity), width = 0.1) +
scale_shape_manual(values = c(19,3,1)) +
scale_y_log10() +
scale_color_brewer(palette = "Set1") +
theme_light() +
ggtitle(l) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
p
}
myplot(DM_nest$data[[1]], "test")
And I recursively apply it to the nested data tibbles …
DM_nest <- DM_nest %>%
left_join(metabolite_meta %>% dplyr::select(CompoundID,CompoundName)) %>%
mutate(rawplots = map2(data,CompoundName, ~myplot(.x,.y)))
## adding compound names as names oif the list of plots
names(DM_nest$rawplots) <- DM_nest$CompoundName
Now I can visualize (or save) the plot I want …
This one, in particular show the trend of one of the features with a large fraction of NAs
DM_nest$rawplots[[5]]
As it can be expected, progesterone is lower in males, and, most likely close to the detection limit. It is worth remarking that progesterone was the steroid showing the lower intensity. Its behavior is then not odd.
Let’s now look to PGD2
DM_nest$rawplots$PGD2
The trend of this metabolite is really wired. Its large fraction of missing values do not seems to be associated to the intensity of the signal
As far as I can see (but you could check better … ;-) ), the distribution of the missing data does not seems to be highly associated with the study factors.
Considering that we have 220 samples the level of missingness is never extremely severe.
Something For you
The need of displaying the intensity in log scale is already suggesting that the distribution of the intensities is far from being normal. This is not unexpected in metabolomics. In real life what I do is to work almost invariably with log-transformed datasets.
For sure
We can follow the spirit of the previous plot by defining a function and using it to produce a full list of plots …
An alternative could be to unnest
the data and rely on
faceting …
DM_nest %>%
dplyr::select(CompoundName,data) %>%
unnest(data) %>%
ggplot(aes(sample = I)) +
stat_qq(aes(col = Gender, shape = Zygosity)) +
scale_shape_manual(values = c(19,3,1)) +
scale_color_brewer(palette = "Set1") +
stat_qq_line() +
facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
theme_light() +
theme(aspect.ratio = 1,
legend.position = "bottom")
Ok, the plot is “ugly” (maybe creating a set of plots as we did before could have been more pleasant), but is really informative.
Let’s focus on it
DM_nest %>%
dplyr::select(CompoundName,data) %>%
unnest(data) %>%
filter(CompoundName == "12(13)Ep-9-KODE") %>%
ggplot(aes(sample = I)) +
stat_qq(aes(col = Gender, shape = Zygosity)) +
scale_shape_manual(values = c(19,3,1)) +
scale_color_brewer(palette = "Set1") +
stat_qq_line() +
facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
theme_light() +
theme(aspect.ratio = 1,
legend.position = "bottom")
DM_nest$rawplots$`12(13)Ep-9-KODE`
The interpretation for that would, most likely require a chat with a good biochemist…
Let’s look to the global effect of a log transformation on the q-q plots
DM_nest %>%
ungroup() %>%
dplyr::select(CompoundName,data) %>%
unnest(data) %>%
ggplot(aes(sample = log10(I))) +
stat_qq(aes(col = Gender, shape = Zygosity)) +
scale_color_brewer(palette = "Set1") +
stat_qq_line() +
facet_wrap(~CompoundName, scales = "free_y", ncol = 5) +
theme_light() +
theme(aspect.ratio = 1,
legend.position = "bottom")
The situation largely improves, even if some compounds are still showing strange behaviors …
Something For you
DM_nest$rawplots$`Taurodeoxycholic acid`
Get out the samples wiuth intensity higher than 2000
names(DM_nest$data) <- DM_nest$CompoundName
critical_1 <- DM_nest$data$`Taurodeoxycholic acid` %>%
filter(I > 1500) %>%
pull(MouseID)
critical_1
## [1] "487464" "356371" "243279" "253789" "40683" "302577" "257807"
The previous univariate journey should have been already pointing out some interesting aspects of the dataset.
The subsequent step is to look to visualize the multivariate structure of the data to spot larger scale patterns.
In order to do that it is necessary to impute the data matrix, substituting NAs with meaningful numbers.
My choice here is to work variable wise and replace the NAs with a random number drawn from an uniform distribution spanning from 0 to the minimum value measured for that variable
The rationale behind this choice is that
In my practical example imputation could be performed both in the
DM
and in its nested version DM_nest
. Since we
are going to do PCA, I’ll work on the unnested data matrix.
First of all I need a function to perform the imputation of a vector
myimputer <- function(v){
if (sum(is.na(v)) == 0) { ## If I do not have NAs please leave the vector unchanged
return(v)
} else {
napos <- which(is.na(v)) ## position of the NAs in the vector
newval <- runif(length(napos),0,min(v, na.rm = TRUE)) ## calculate the random numbers I should put in place of the NAs
out <- v
out[napos] <- newval
return(out)
}
}
Now we apply it to the full set of columns
DM_i <- DM %>%
mutate(across(starts_with("CPD"), ~ myimputer(.x)))
As we have seen, log transformation is helpful …
DM_i <- DM_i %>%
mutate(across(starts_with("CPD"), ~log10(.x)))
And now PCA!
library(FactoMineR)
library(factoextra)
myPCA <- PCA(DM_i %>%
column_to_rownames("MouseID") %>%
dplyr::select(starts_with("CPD")),
graph = FALSE, scale.unit = FALSE ## to avoid factmineR producing a plot by default
)
library(plotly)
fviz_pca_ind(myPCA,
habillage = factor(sample_meta_data$Gender), ## factor which will be used to color the dots
# geom = "point",
pointsize = 2,
axes = c(1,2)) +
scale_color_brewer(palette = "Set1")
Even if the amount of variability captured in this representation is only the 30%, the previous plot shows the presence of a separation between males and females.
It would be nice to have a similar plot highlighting the role of the different genetic backgrounds, but we do not have enough colors to put everything in the same plot.
## we get out the scores
PC_12_scores <- facto_summarize(myPCA, "ind",
axes = c(1,2),
result = "coord"
) %>% ## get the data from the PCA object
mutate(name = as.character(name)) %>%
as_tibble() %>%
## transform the sample name to a character
dplyr::rename("MouseID" = "name") %>% ## align the name with the correct name
left_join(sample_meta_data) ## join with the sample data
PC_12_scores
## # A tibble: 220 × 7
## MouseID Dim.1 Dim.2 coord Genotype Gender Zygosity
## <chr> <dbl> <dbl> <dbl> <chr> <chr> <chr>
## 1 413741 0.536 -1.47 2.46 Pebp1 Male Homo-
## 2 413742 -1.12 0.646 1.67 Pebp1 Male Homo-
## 3 413745 -0.710 -2.37 6.12 Pebp1 Male Homo-
## 4 428689 1.21 -1.54 3.81 Pebp1 Female Homo-
## 5 428693 0.674 -0.586 0.797 Pebp1 Female Homo-
## 6 433002 0.638 -0.922 1.26 Pebp1 Female Homo-
## 7 185262 0.0677 0.346 0.125 Idh1 Male Homo-
## 8 115900 -2.60 0.423 6.92 Idh1 Male Homo-
## 9 115996 0.592 -0.768 0.940 Idh1 Male Homo-
## 10 138110 1.67 0.342 2.92 Idh1 Female Homo-
## # ℹ 210 more rows
PC_12_scores %>%
ggplot() +
geom_point(data = PC_12_scores %>% select(Dim.1,Dim.2),
mapping = aes(x= Dim.1, y = Dim.2), col = "gray80", size = 0.2) +
geom_hline(yintercept = 0, lty = 2)+
geom_vline(xintercept = 0, lty = 2)+
geom_point(aes(x = Dim.1, y = Dim.2, col = Gender)) +
facet_wrap(~Genotype, ncol = 6) +
theme_bw() +
theme(aspect.ratio = 1)
Suppose that I’m reall yinterested in the comparison between Pipox and Null (wt) … Could you make a PCA plot only of these classes of samples?
So, at least in the PCA, no clear separation between the genotypes
This is not unexpected considering the results of our univariate investigation
What about the variables?
fviz_pca_biplot(myPCA,
habillage = factor(sample_meta_data$Gender),
geom = "point",
pointsize = 2,
axes = c(1,2)) +
scale_color_brewer(palette = "Set1")
The biplot here shows that females show an higher level of the large majority of the metabolites.
Since the partial separation is visible along PC1 we could visualize the variables which show the higher contribution to that direction
fviz_contrib(myPCA, "var", axes = 1)
These plots are really informative, let’s give a look inside the PCA object to see how these number could be get out.
This list contains the information about the samples (the scores) so it can be used to reconstruct the score plot. What one needs are the coordinates
str(myPCA$ind)
## List of 4
## $ coord : num [1:220, 1:5] 0.536 -1.121 -0.71 1.206 0.674 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:220] "413741" "413742" "413745" "428689" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ cos2 : num [1:220, 1:5] 0.0483 0.1473 0.0386 0.1559 0.0494 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:220] "413741" "413742" "413745" "428689" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ contrib: num [1:220, 1:5] 0.0266 0.1164 0.0467 0.1347 0.0421 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:220] "413741" "413742" "413745" "428689" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ dist : Named num [1:220] 2.44 2.92 3.61 3.05 3.03 ...
## ..- attr(*, "names")= chr [1:220] "413741" "413742" "413745" "428689" ...
The dist
and cos2
can be used to assess how
far a sample is from the PCA plane (or projection). This is important to
identify potential outliers.
The list
str(myPCA$var)
## List of 4
## $ coord : num [1:57, 1:5] 0.4761 0.5302 -0.0835 0.4521 0.093 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ cor : num [1:57, 1:5] 0.857 0.724 -0.143 0.621 0.162 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ cos2 : num [1:57, 1:5] 0.7345 0.5244 0.0206 0.3857 0.0262 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
## $ contrib: num [1:57, 1:5] 4.623 5.733 0.142 4.17 0.176 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:57] "CPD000001" "CPD000002" "CPD000003" "CPD000004" ...
## .. ..$ : chr [1:5] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
instead contains the info about the variables, their loadings and the correlation with the principal components.
To extract the previous information in a more programmatic way,
FactoMineR
has a specific function
head(facto_summarize(myPCA, "var", axes = 1))
## name Dim.1 coord cos2 contrib
## CPD000001 CPD000001 0.47609423 0.226665712 0.73449192 4.6229967
## CPD000002 CPD000002 0.53016363 0.281073479 0.52436795 5.7326790
## CPD000003 CPD000003 -0.08347961 0.006968846 0.02057624 0.1421342
## CPD000004 CPD000004 0.45214785 0.204437678 0.38572027 4.1696413
## CPD000005 CPD000005 0.09296364 0.008642238 0.02622314 0.1762641
## CPD000006 CPD000006 -0.31901791 0.101772425 0.12997291 2.0757157
which returns as a table the content of the PCA object.
By using the content of the myPCA object it is possible to
“reconstruct” the PCA plots with ggplot
and this can be
handy if specific representation of the data are needed.
Suppose, for example, that you would like to construct a more informative variable importance plot which shows the correct metabolite name and also color the bars according to the chemical/metabolic class of the compound.
facto_summarize(myPCA, "var", axes = 1) %>%
as_tibble() %>%
dplyr::select(name, `Dim.1`) %>%
left_join(metabolite_meta, by = c("name" = "CompoundID")) %>%
arrange(`Dim.1`) %>%
mutate(CompoundName = factor(CompoundName, levels = CompoundName)) %>%
ggplot() +
geom_point(aes(x = CompoundName, y= `Dim.1`, col = Assay), size = 2) +
geom_segment(aes(x = CompoundName, yend = `Dim.1`, col = Assay, xend = CompoundName, y = 0)) +
scale_color_brewer(palette = "Set2") +
coord_flip() +
theme_light()
Which can be of help for the interpretation of the results
Something For you
What we did so far is called exploratory data analysis. To write the paper we should also find the variables which are showing a significant effect of the design factors.
Notes
For illustrative purposes let’s focus on the Female-Male Comparison.
## here i nest the imputed data matrix, most likely I could do the testing also in the "non" imputed one
## we also add the gender of the animal
DM_i_nest <- DM_i %>%
pivot_longer(starts_with("CPD"), names_to = "CompoundID", values_to = "I") %>%
left_join(sample_meta_data) %>%
nest(data = -c(CompoundID))
Now we run the tests on the metabolites
DM_i_nest <- DM_i_nest %>%
mutate(wilcoxon = map(data, ~ wilcox.test(I ~ Gender, data = .x, exact = FALSE)))
If we want to see one of the tests …
DM_i_nest$wilcoxon[[10]]
##
## Wilcoxon rank sum test with continuity correction
##
## data: I by Gender
## W = 7327.5, p-value = 0.006827
## alternative hypothesis: true location shift is not equal to 0
Now we extract the p-values
DM_i_nest <- DM_i_nest %>%
mutate(ps = map_dbl(wilcoxon, ~.x$p.value))
since we have potentially false positive it is always a good idea to look to the distribution of the p-values
hist(DM_i_nest$ps, breaks = 20)
We see a clear enrichment of the “low” p values. One could ask what would be the distribution of the p-values if there would be no difference between males and females.
Let’s try so simulate this case. To do that I scramble the intensities for each metabolite, so I’ll destroy the assiciation between the intensity and the gender
DM_i_nest <- DM_i_nest %>%
mutate(permuted_data = map(data, ~ .x %>% mutate(I = sample(I))))
Let’s now run a wikoxon test on the permuted data
DM_i_nest <- DM_i_nest %>%
mutate(wilcoxon_p = map(permuted_data, ~ wilcox.test(I ~ Gender, data = .x, exact = FALSE))) %>%
mutate(ps_p = map_dbl(wilcoxon_p, ~.x$p.value))
And now we plot the histogram
hist(DM_i_nest$ps_p, breaks = 20)
This is clearly demonstrating that there is a difference between males and females and that this difference is “statistically significant”
Now we would like to rank the variables on the base of their contrast on the two classes … but as we know it is not correct to use the p-value for that.
We need to calculate the effect size! Now I’ll use the Cohen’s d. A word of caution here, this type of effect size is parametric … but I’ve been stressing that non parametric analysis here should be preferred. Right. A non parametric measure of the effect size should be better. A possibility could be to calculate the ratio between the difference between the two medians and divide it by the larger interquartile range. You could be creative there …
Let’s stick to Cohen’s d. In the same time I’ll also calculate the p-values corrected for multiplicity
library(effsize)
DM_i_nest <- DM_i_nest %>%
mutate(pcorr = p.adjust(ps, method = "BH")) %>% ## here I'm calculating the corrected p-values
mutate(cohend = map_dbl(data, ~cohen.d(I ~ Gender, data = .x)$estimate)) ## here I'm calculating the effect size
Now we use the effect size and the p value to create a volcano plot
library(plotly)
ggplotly(DM_i_nest %>%
left_join(metabolite_meta %>% select(CompoundID,CompoundName, Assay)) %>%
ggplot(aes(text=CompoundName)) +
geom_hline(yintercept = 1.3, lty = 2, col = "red") +
geom_point(aes(x = cohend, y = -log10(ps), fill = Assay), size = 3, pch = 21) +
theme_bw(), tooltip="text")
Now we focus on the compound which are significant at the 0.01 level and we sort the results by effect size …
markers <- DM_i_nest %>%
filter(pcorr < 0.01) %>%
arrange(desc(abs(cohend))) %>%
left_join(metabolite_meta)
head(markers)
## # A tibble: 6 × 25
## CompoundID data wilcoxon ps permuted_data wilcoxon_p ps_p pcorr
## <chr> <list> <list> <dbl> <list> <list> <dbl> <dbl>
## 1 CPD000829 <tibble> <htest> 8.22e-30 <tibble> <htest> 0.122 4.69e-28
## 2 CPD000830 <tibble> <htest> 4.71e-21 <tibble> <htest> 0.118 1.34e-19
## 3 CPD000807 <tibble> <htest> 1.84e-19 <tibble> <htest> 0.169 3.49e-18
## 4 CPD000808 <tibble> <htest> 4.90e-17 <tibble> <htest> 0.564 6.99e-16
## 5 CPD000005 <tibble> <htest> 2.05e-14 <tibble> <htest> 0.298 1.95e-13
## 6 CPD000806 <tibble> <htest> 8.43e-12 <tibble> <htest> 0.413 5.34e-11
## # ℹ 17 more variables: cohend <dbl>, Order <dbl>, DataBaseID <dbl>, Type <chr>,
## # Assay <chr>, AnnotationApproach <chr>, CompoundName <chr>,
## # IKFirstBlock <chr>, InChiKey <chr>, RetentionTime <dbl>, MZ <dbl>,
## # SMILES <chr>, KEGGID <chr>, PubChemID <chr>, QC_RSD <dbl>,
## # BlankRatio <dbl>, AcylChain <chr>
This is the list of “biomarkers”, let’s plot one of them …
DM_nest$rawplots[[54]]
Which exactly shows what we expect.
Notes
The association between the intensity of two metabolites in the samples is often considered an interesting evidence in the perspective of the interpretation. Metabolites are indeed linked in metabolic pathways and their concentrations are not independent
As we have seen yesterday, however, correlation can be there only by chance so care have to be taken to jump from the observation of an association to the conclusion of a causal relation.
These caveats are particularly true in the case of untargeted metabolomics assays and can be somehow taken more easily as far as targeted metabolomics is concerned.
The best tool to inspect variable correlation in R is the
pairs
plot. With pairs
one really sees the
experimental data and it is easy to spot subgroups and outlying
samples.
Unfortunately, the pairs visualization is manageable only with a small number of variables (something like 15). For this reason it is a resource which can be exploited after the identification of the most relevant biomarkers
## here I get the names of the first 5
focuscp <- markers %>%
slice(1:5) %>%
pull(CompoundID)
## This is a littel of magic ...
mypalette <- c("#F24D2970","#1C366B70")
names(mypalette) <- unique(sample_meta_data$Gender)
par(pty="s")
DM_i %>%
dplyr::select(all_of(focuscp)) %>%
as.matrix(.) %>%
pairs(., col = mypalette[sample_meta_data$Gender], pch = 19)
Notes
Something For you