library(xcms)
library(MsExperiment)
library(Spectra)
library(tidyverse)
library(plotly)
A fast recap about what we know about metabolites in LC-MS …
So every metabolite …
So …
The process of finding chromatographic peaks in the data is called peak picking. It can be done in many different ways, and actually every software will do it slightly differently. The first step in the analysis of our dataset will be to pick the peaks in the full set of samples. Here I’ll show you the process on one file.
There are some important points to remember.
Working on a representative file will help you in fine tuning and benchmarking your parameters.
Let’s start reading in a raw file:
raw_one <- readMsExperiment(
spectraFiles = "../data/wines/x016_X_QC_X_4_NEG_DDA.mzML",
msLevel. = 1, ## we read only MS1
mode = "onDisk") ## with this parameter the data are not loaded into RAM
I’ll now show to you how to perform peak picking with two algorithms
available in xcms. In the 99% of the case you will use only
one of them (CentWave), but it is nice - once in the life - to really
put your hands in the machine.
The “older” and most sounding way of finding peaks implemented in
xcms is the matched filter algorithm.
A full description of the parameters of the algorithm can be found in
the xcms
manual, here we focus on:
In xcms the parameters of the algorithm are stored into
a specific object:
mf <- MatchedFilterParam(binSize = 0.1,
fwhm = 6,
snthresh = 5)
mf
## Object of class: MatchedFilterParam
## Parameters:
## - binSize: [1] 0.1
## - impute: [1] "none"
## - baseValue: numeric(0)
## - distance: numeric(0)
## - fwhm: [1] 6
## - sigma: [1] 2.547987
## - max: [1] 5
## - snthresh: [1] 5
## - steps: [1] 2
## - mzdiff: [1] 0.6
## - index: [1] FALSE
Now I can use the previous parameters to find the peaks in one sample:
## this function is used to find the chromatographic peaks with the previous parameters
raw_one_mf_picked <- findChromPeaks(raw_one, param = mf)
raw_one_mf_picked
## Object of class XcmsExperiment
## Spectra: MS1 (546) MS2 (546) MS3 (546)
## Experiment data: 1 sample(s)
## Sample data links:
## - spectra: 1 sample(s) to 1638 element(s).
## xcms results:
## - chromatographic peaks: 638 in MS level(s): 1
Ok, the software did his job. As you can see it was able to fine 638
peaks in this sample. As you can see the raw_one_mf_picked
still holds peaks and raw data. The peak table can be extracted with a
specific method
mf_peaks <- chromPeaks(raw_one_mf_picked)
dim(mf_peaks)
## [1] 638 13
head(mf_peaks, 5)
## mz mzmin mzmax rt rtmin rtmax into intf
## CP001 50.01399 50.01241 50.04396 337.7074 334.5620 340.1007 358728.6 387501.5
## CP002 51.67812 51.67799 51.67819 306.2680 303.3541 311.2829 529555.9 428820.5
## CP003 53.00739 53.00735 53.00742 327.9940 326.5501 330.6442 1460325.7 1670480.7
## CP004 53.23135 53.23127 53.23137 303.3541 301.9032 306.2680 419943.3 519133.2
## CP005 54.56881 54.56869 54.56895 276.0245 273.0395 277.6028 183758.4 203009.4
## maxo maxf i sn sample
## CP001 99313.48 102323.41 1 6.905211 1
## CP002 115270.37 94863.14 1 7.721503 1
## CP003 478205.50 461500.98 1 16.710695 1
## CP004 166208.92 141195.95 1 6.904875 1
## CP005 37604.99 45538.38 1 5.020740 1
Let’s walk to the most relevant columns:
Let’s now give a look to the position of the peaks in the mz/rt plane. The size of the point will be proportional to the intensity
mf_peaks %>%
as.data.frame() %>%
mutate(into = sqrt(into)) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, alpha = into, col = into)) +
scale_color_viridis_c() +
theme_bw()
If you go back to the previous demo, we were focusing on a specific area of the raw signal which was particularly promising
rt <- rtime(raw_one)
mz <- mz(spectra(raw_one))
I <- intensity(spectra(raw_one))
spectra_tibble <- tibble(rt = rt,
mz = as.list(mz),
I = as.list(I))
sub_peaks_mf <- mf_peaks %>%
as.data.frame() %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300)
ggplotly(spectra_tibble %>%
unnest(c(mz,I)) %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) +
geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) +
scale_color_viridis_c() +
theme_light())
What we see:
This view gives an idea of the boundaries of the peaks:
spectra_tibble %>%
unnest(c(mz,I)) %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) +
geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) +
geom_segment(data = sub_peaks_mf, aes(x = rtmin, xend = rtmax, y = mz, yend = mz), col = "red") +
scale_color_viridis_c() +
theme_light()
So some lines are superimposed, some others not. Tricky business!
Peak picking can also be performed with another algorithm:
CentWave. The algorithm applied here is more clever and
better suited for high resolution data. As we have seen in the lecture,
it relies on the fact that the mass trace get stable in presence of a
strong ionic signal.
Also here many parameters (and others are not mentioned). I highlight here some of them:
cwp <- CentWaveParam(peakwidth = c(5, 30), ## expected range of chromatographic peak width
ppm = 15, ## tolerance to identify ROIs in the mz/rt plane
prefilter = c(5, 50000),## number of consecutive scans showing a signal higher than 50,000
noise = 5000) ## minimum signal to be considered
cwp
## Object of class: CentWaveParam
## Parameters:
## - ppm: [1] 15
## - peakwidth: [1] 5 30
## - snthresh: [1] 10
## - prefilter: [1] 5 50000
## - mzCenterFun: [1] "wMean"
## - integrate: [1] 1
## - mzdiff: [1] -0.001
## - fitgauss: [1] FALSE
## - noise: [1] 5000
## - verboseColumns: [1] FALSE
## - roiList: list()
## - firstBaselineCheck: [1] TRUE
## - roiScales: numeric(0)
## - extendLengthMSW: [1] FALSE
## - verboseBetaColumns: [1] FALSE
If we run the peak picking with this new algorithm…
raw_one_cw_picked <- findChromPeaks(raw_one, param = cwp)
raw_one_cw_picked
## Object of class XcmsExperiment
## Spectra: MS1 (546) MS2 (546) MS3 (546)
## Experiment data: 1 sample(s)
## Sample data links:
## - spectra: 1 sample(s) to 1638 element(s).
## xcms results:
## - chromatographic peaks: 706 in MS level(s): 1
Ok here we see that the number of spotted peaks is different. If you think to it it is not strange: different method different results.
The first natural question you can ask is: where are you getting the parameters? Well, as I already told you at some point the “expert knowledge” enters the process. Reasonable guess for them is always coming from a good knowledge of the analytical pipeline. Optimal value could be optimized in an automatic way (like IPO does), but reasonable guesses are fundamental to restrict the quest for the optimal solution in a multidimensional space. So either you understand the analytics, or you should go and speak with you “analytical” colleague.
The second question. Are we getting comparable results?
cw_peaks <- chromPeaks(raw_one_cw_picked)
dim(cw_peaks)
## [1] 706 11
head(cw_peaks, 5)
## mz mzmin mzmax rt rtmin rtmax into intb
## CP001 218.9822 218.9817 218.9826 34.5632 19.4760 46.1335 1236829.6 1236802.9
## CP002 218.9821 218.9820 218.9822 3.1473 0.6727 6.9058 479904.3 479896.8
## CP003 218.9819 218.9817 218.9823 11.9442 6.9058 16.9582 661522.6 661511.3
## CP004 218.9821 218.9818 218.9823 20.7363 16.9582 24.4925 462471.5 462462.7
## CP005 218.9817 218.9805 218.9826 43.3485 40.8326 46.1335 136005.4 136000.1
## maxo sn sample
## CP001 65368.86 65368 1
## CP002 79109.52 79109 1
## CP003 71700.71 71700 1
## CP004 60961.76 60961 1
## CP005 32284.30 32283 1
ggplotly(mf_peaks %>%
as.data.frame() %>%
ggplot() +
geom_point(aes(x = rt, y = mz), col = "red", pch = 3, alpha = 0.5) +
geom_point(cw_peaks %>% as_tibble(), mapping = aes(x = rt, y = mz), col = "blue", pch = 1, alpha = 0.5) +
theme_bw())
Different, isn’t it? In some cases the two algorithms are coherent, in others the results are markedly different. Centwave it is also giving this strange horizontal line at large rt
Let’s give a look to our subregion:
sub_peaks_cw <- cw_peaks %>%
as.data.frame() %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300)
ggplotly( spectra_tibble %>%
unnest(c(mz,I)) %>%
filter(mz > 284 & mz < 300) %>%
filter(rt > 200 & rt < 300) %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) +
geom_point(data = sub_peaks_cw, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) +
scale_color_viridis_c() +
theme_light())
In real life situations, when you are happy with your peak picking
parameters you run the process on all your samples. xcms is
designed for that, the only thing you have to do is to load more than on
e file.
In view of the step in which the peak list will be merged together it is important also to add to the object containing the raw data any type of meta information which is associated to the samples. Typical meta information includes:
In xcms all these infos are stored in an
AnnotatedDataFrame object.
Typically all these infos should be included in the filename, or in a
csv which links the filename with the metadata. In the dataset we are
using everything is in the file name, so let’s process them. In the
following we will assume that the data are stored as mzML
inside the data subfolder.
library(tools) ## just to use file_path_sans_ext
phenodata <- tibble(filepath = list.files("../data/wines/", pattern = ".mzML", full.names = TRUE)) %>%
filter(!grepl("mix",filepath)) %>% ## remove the injections of the standard mixes
mutate(fname = file_path_sans_ext(basename(filepath))) %>%
separate(fname, into = c("inj_ord", "color", "variety", "bottle", "rep", "polarity", "mode"),
sep = "_", remove = FALSE)
## to see the content
phenodata
## # A tibble: 29 × 9
## filepath fname inj_ord color variety bottle rep polarity mode
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ../data/wines//x001_… x001… x001 X blank X 1 NEG DDA
## 2 ../data/wines//x002_… x002… x002 X QC X 1 NEG DDA
## 3 ../data/wines//x003_… x003… x003 X QC X 2 NEG DDA
## 4 ../data/wines//x004_… x004… x004 red sangio B 1 NEG DDA
## 5 ../data/wines//x005_… x005… x005 red ptnoir A 1 NEG DDA
## 6 ../data/wines//x006_… x006… x006 red merlot A 1 NEG DDA
## 7 ../data/wines//x007_… x007… x007 wht vermen A 1 NEG DDA
## 8 ../data/wines//x008_… x008… x008 red cannon A 1 NEG DDA
## 9 ../data/wines//x009_… x009… x009 wht chardo B 1 NEG DDA
## 10 ../data/wines//x010_… x010… x010 wht chardo A 1 NEG DDA
## # ℹ 19 more rows
Here we read the data in:
raw_data <- readMsExperiment(
spectraFiles = phenodata$filepath,
msLevel. = 1,
sampleData = phenodata, ## this is the structure of xcms holding phenotypic data
mode = "onDisk") ## with this parameter the data are not loaded into RAM
And perform the peak picking with centwave:
register(SerialParam()) ## Setting xcms in serial mode
raw_data <- findChromPeaks(raw_data, param = cwp)
To spare time we save the picked object …
save(raw_data, file = "wines.RData")
Let’s give a look to the content of the full dataset:
raw_data
## Object of class XcmsExperiment
## Spectra: MS1 (15688) MS2 (15688) MS3 (15688)
## Experiment data: 29 sample(s)
## Sample data links:
## - spectra: 29 sample(s) to 47064 element(s).
## xcms results:
## - chromatographic peaks: 18777 in MS level(s): 1
Here we are dealing with 29 files with different characteristics. After the peak picking optimization step we are confident that the parameters we have been choosing should give reasonably good results. Before going on, it is however useful to make some additional checks on the data quality.
The first thing to do is to monitor the TIC of the 29 experiments to spot potential drops of the signal during the analysis:
tics <- chromatogram(raw_data,
aggregationFun = "sum", ## "sum" for "TIC" / "max" for "BPC"
include = "none") ## this argument is required to avoid
## Warning in .local(object, ...): Parameter 'include' is deprecated, please use
## 'chromPeaks' instead
## including all the chromatograms of the picked peaks
We now plot them, with colors matching the sample class:
mypalette <- c("steelblue", "coral", "darkgreen")
names(mypalette) <- c("red","X","wht")
plot(tics, col = mypalette[sampleData(raw_data)$color]) ## raw_data$color is getting out the phenodata column called color
legend("topright", legend = c("red","Blank & QC","white"), col = mypalette, lty = 1)
The plot shows already nice things:
The trend with the injection can be visualized as follows:
## Visualize the overall trend in TIC along the injections
tc <- split(tic(spectra(raw_data)), f = fromFile(raw_data))
full_tics <- sapply(tc, sum)
sampleData(raw_data) %>%
data.frame(.) %>%
tibble() %>%
add_column(full_tics = full_tics) %>%
ggplot() +
geom_point(aes(x = inj_ord, y = full_tics, col = color), size = 2) +
theme_bw() +
theme(aspect.ratio = 0.3)
So no clear trend is visible in the data. X here represents QC, blanks and StdMixes
Another type of visualization that I find useful with reasonably small datasets is the following:
chromPeaks(raw_data) %>%
as_tibble() %>%
ggplot() +
geom_point(aes(x = rt, y = mz, col = log10(maxo)), size = 1, alpha = 0.5) +
scale_color_viridis_c() +
facet_wrap(~factor(sample)) +
theme_bw() +
theme(aspect.ratio = 0.5)
Here we see the peak maps of the different files. It is clear that each map is expected to be slightly different, but rally outlying samples will show up clearly. Look to the blanks (i.e., samples 1, 14, 27), for example, or to the horizontal lines which are present in many files.
Something for you now:
filterFile function to get rid of the
blanks!)? Are they the same?
Comments
centWaveis non to be more reliable results