Picking Peaks

library(xcms)
library(tidyverse)
library(plotly)

Introduction

A fast recap about what we know about metabolites in LC-MS …

Every metabolite is more-or-less separated by the chromatographic column in time.
In the ideal world, every metabolite would “elute” in a narrow time window: we do not live in an ideal world
Inside the ionization source every metabolite is transformed in a cloud of ions (fragments, isotopes, adducts, …)

So every metabolite …

produces a peak in the ion “current” of all the ions produced during its ionization
every metabolite produces peaks in more than one ionic trace
in many cases the same ion could be produced in the ionization of different metabolites

So …

to find metabolites we have to automatically look for peaks in the mz/rt plane
we will have much more peaks than metabolites… and this will make the analysis of our data matrix extremely challenging

Peak Picking

The process of finding chromatographic peaks in the data is called peak picking. It can be done in many different ways, and actually every software will do it slightly differently. The first step in the analysis of our dataset will be to pick the peaks in the full set of samples. Here I’ll show you the process on one file.

There are some important points to remember.

there is no perfect solution: every algorithm will miss something you would pick
anyway an automatic solution is better because it is reproducible
every algorithm will have parameters to tune
expert knowledge will be useful at some point

Working on a representative file will help you in fine tuning and benchmarking your parameters.

Let’s start reading in a raw file:

raw_one <- readMSData(
  files = "../data/wines/x016_X_QC_X_4_NEG_DDA.mzML",
  msLevel. = 1, ## we read only MS1
  mode = "onDisk")  ## with this parameter the data are not loaded into RAM

I’ll now show to you how to perform peak picking with two algorithms available in xcms. In the 99% of the case you will use only one of them (CentWave), but it is nice - once in the life - to really put your hands in the machine.

Peak picking: matched filter

The “older” and most sounding way of finding peaks implemented in xcms is the matched filter algorithm.

A full description of the parameters of the algorithm can be found in the xcms manual, here we focus on:

binSize: the “width” of the bins used to extract the ionic traces where we look for peaks
fwhm: the “expected” width of the peak in the chromatographic direction
snthresh: the signal-to-noise threshold used to say: “yeah, this is a peak and not a bump in the noise”

In xcms the parameters of the algorithm are stored into a specific object:

mf <- MatchedFilterParam(binSize = 0.1, 
                         fwhm = 6, 
                         snthresh = 5) 
mf

## Object of class:  MatchedFilterParam 
##  Parameters:
##  - binSize: [1] 0.1
##  - impute: [1] "none"
##  - baseValue: numeric(0)
##  - distance: numeric(0)
##  - fwhm: [1] 6
##  - sigma: [1] 2.547987
##  - max: [1] 5
##  - snthresh: [1] 5
##  - steps: [1] 2
##  - mzdiff: [1] 0.6
##  - index: [1] FALSE

Now I can use the previous parameters to find the peaks in one sample:

## this function is used to find the chromatographic peaks with the previous parameters
raw_one_mf_picked <- findChromPeaks(raw_one, param = mf)

## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'S4Vectors'
## 
## The following object is masked from 'package:utils':
## 
##     findMatches
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## 
## Attaching package: 'ProtGenerics'
## 
## The following object is masked from 'package:stats':
## 
##     smooth
## 
## 
## This is MSnbase version 2.28.1 
##   Visit https://lgatto.github.io/MSnbase/ to get started.
## 
## 
## Attaching package: 'MSnbase'
## 
## The following object is masked from 'package:base':
## 
##     trimws
## 
## 
## This is xcms version 4.0.0 
## 
## 
## Attaching package: 'xcms'
## 
## The following object is masked from 'package:stats':
## 
##     sigma

raw_one_mf_picked

## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.24 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 546 
##  MSn retention times: 0:01 - 11:60 minutes
## - - - Processing information - - -
## Data loaded [Tue Nov 14 11:26:32 2023] 
## Filter: select MS level(s) 1. [Tue Nov 14 11:26:32 2023] 
##  MSnbase version: 2.28.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: x016_X_QC_X_4_NEG_DDA.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   x016_X_QC_X_4_NEG_DDA.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S0001 F1.S0004 ... F1.S1636 (546 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: matchedFilter 
##  638 peaks identified in 1 samples.
##  On average 638 chromatographic peaks per sample.

Ok, the software did his job. As you can see it was able to fine 638 peaks in this sample. As you can see the raw_one_mf_picked still holds peaks and raw data. The peak table can be extracted with a specific method

mf_peaks <- chromPeaks(raw_one_mf_picked) 
dim(mf_peaks)

## [1] 638  13

head(mf_peaks, 5)

##             mz    mzmin    mzmax       rt    rtmin    rtmax      into      intf
## CP001 50.01399 50.01241 50.04396 337.7074 334.5620 340.1007  358728.6  387501.5
## CP002 51.67812 51.67799 51.67819 306.2680 303.3541 311.2829  529555.9  428820.5
## CP003 53.00739 53.00735 53.00742 327.9940 326.5501 330.6442 1460325.7 1670480.7
## CP004 53.23135 53.23127 53.23137 303.3541 301.9032 306.2680  419943.3  519133.2
## CP005 54.56881 54.56869 54.56895 276.0245 273.0395 277.6028  183758.4  203009.4
##            maxo      maxf i        sn sample
## CP001  99313.48 102323.41 1  6.905211      1
## CP002 115270.37  94863.14 1  7.721503      1
## CP003 478205.50 461500.98 1 16.710695      1
## CP004 166208.92 141195.95 1  6.904875      1
## CP005  37604.99  45538.38 1  5.020740      1

Let’s walk to the most relevant columns:

rownames gives the id of each chromatographic peak. CPxxx stands for Chromatographic Peak xxx
mz* columns identify the mz slice (remember that mz is measured over time so it’s slightly changing from scan to scan)
rt* columns identify the boundaries and the apex of each chromatographic peak
into, intf, maxo and maxf are measures of intensity of the peak. What is normally used are maxo and into which are the signal at the apex of the peak and the integral of the signal across the peak area, respectively.
sample: this column is telling us where each peak was found

Let’s now give a look to the position of the peaks in the mz/rt plane. The size of the point will be proportional to the intensity

mf_peaks %>% 
  as.data.frame() %>% 
  mutate(into = sqrt(into)) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, alpha = into, col = into)) + 
  scale_color_viridis_c() + 
  theme_bw()

If you go back to the previous demo, we were focusing on a specific area of the raw signal which was particularly promising

rt <- rtime(raw_one)
mz <- mz(raw_one)
I <- intensity(raw_one)

sub_peaks_mf <- mf_peaks %>% 
  as.data.frame() %>% 
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) 

ggplotly(tibble(rt = rt, mz = mz, I = I)  %>% 
  unnest(c("mz","I")) %>%
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) + 
  geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) + 
  scale_color_viridis_c() + 
  theme_light())

What we see:

the algorithm did a reasonably good job, even if there are several areas with potentially high signal which were not picked
as expected many horizontal traces are showing more than one peak! This is the signature that these ions are produced in the ionization of several metabolites
something interesting is happening around 250 s: it seems that there many ions are showing a peak, but the position of the maxima is not exactly the same. They could be different metabolites, or maybe the algorithm was not finding the peak maxima at the same rt.

This view gives an idea of the boundaries of the peaks:

tibble(rt = rt, mz = mz, I = I)  %>% 
  unnest(c("mz","I")) %>%
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) + 
  geom_point(data = sub_peaks_mf, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) + 
  geom_segment(data = sub_peaks_mf, aes(x = rtmin, xend = rtmax, y = mz, yend = mz), col = "red") + 
  scale_color_viridis_c() + 
  theme_light()

So some lines are superimposed, some others not. Tricky business!

Peak Picking Cent Wave

Peak picking can also be performed with another algorithm: CentWave. The algorithm applied here is more clever and better suited for high resolution data. As we have seen in the lecture, it relies on the fact that the mass trace get stable in presence of a strong ionic signal.

Also here many parameters (and others are not mentioned). I highlight here some of them:

peakwidth: this is the expected range of the width of the chromatographic peaks. In this case from 5 to 30 seconds.
ppm: this is the expected mass shift of the signal of a “true” ion due to electric noise
prefilter: this is an initial filter which will consider valid only ion traces which are preserving a signal of more than I (50,000) counts for at least k (5) scans.
noise: this is the minimum signal which will be considered

cwp <- CentWaveParam(peakwidth = c(5, 30),   ## expected range of chromatographic peak width
                     ppm = 15,               ## tolerance to identify ROIs in the mz/rt plane
                     prefilter = c(5, 50000),## number of consecutive scans showing a signal higher than 50,000
                     noise = 5000)           ## minimum signal to be considered
cwp

## Object of class:  CentWaveParam 
##  Parameters:
##  - ppm: [1] 15
##  - peakwidth: [1]  5 30
##  - snthresh: [1] 10
##  - prefilter: [1]     5 50000
##  - mzCenterFun: [1] "wMean"
##  - integrate: [1] 1
##  - mzdiff: [1] -0.001
##  - fitgauss: [1] FALSE
##  - noise: [1] 5000
##  - verboseColumns: [1] FALSE
##  - roiList: list()
##  - firstBaselineCheck: [1] TRUE
##  - roiScales: numeric(0)
##  - extendLengthMSW: [1] FALSE

If we run the peak picking with this new algorithm…

raw_one_cw_picked <- findChromPeaks(raw_one, param = cwp)

## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which.max, which.min
## 
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## 
## Attaching package: 'S4Vectors'
## 
## The following object is masked from 'package:utils':
## 
##     findMatches
## 
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## 
## 
## Attaching package: 'ProtGenerics'
## 
## The following object is masked from 'package:stats':
## 
##     smooth
## 
## 
## This is MSnbase version 2.28.1 
##   Visit https://lgatto.github.io/MSnbase/ to get started.
## 
## 
## Attaching package: 'MSnbase'
## 
## The following object is masked from 'package:base':
## 
##     trimws
## 
## 
## This is xcms version 4.0.0 
## 
## 
## Attaching package: 'xcms'
## 
## The following object is masked from 'package:stats':
## 
##     sigma
## 
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 753 regions of interest ... OK: 706 found.

raw_one_cw_picked

## MSn experiment data ("XCMSnExp")
## Object size in memory: 0.24 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 546 
##  MSn retention times: 0:01 - 11:60 minutes
## - - - Processing information - - -
## Data loaded [Tue Nov 14 11:26:32 2023] 
## Filter: select MS level(s) 1. [Tue Nov 14 11:26:32 2023] 
##  MSnbase version: 2.28.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: x016_X_QC_X_4_NEG_DDA.mzML
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   x016_X_QC_X_4_NEG_DDA.mzML 
## protocolData: none
## featureData
##   featureNames: F1.S0001 F1.S0004 ... F1.S1636 (546 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  706 peaks identified in 1 samples.
##  On average 706 chromatographic peaks per sample.

Ok here we see that the number of spotted peaks is different. If you think to it it is not strange: different method different results.

The first natural question you can ask is: where are you getting the parameters? Well, as I already told you at some point the “expert knowledge” enters the process. Reasonable guess for them is always coming from a good knowledge of the analytical pipeline. Optimal value could be optimized in an automatic way (like IPO does), but reasonable guesses are fundamental to restrict the quest for the optimal solution in a multidimensional space. So either you understand the analytics, or you should go and speak with you “analytical” colleague.

The second question. Are we getting comparable results?

cw_peaks <- chromPeaks(raw_one_cw_picked) 
dim(cw_peaks)

## [1] 706  11

head(cw_peaks, 5)

##             mz    mzmin    mzmax      rt   rtmin   rtmax      into      intb
## CP001 218.9822 218.9817 218.9826 34.5632 19.4760 46.1335 1236829.6 1236802.9
## CP002 218.9821 218.9820 218.9822  3.1473  0.6727  6.9058  479904.3  479896.8
## CP003 218.9819 218.9817 218.9823 11.9442  6.9058 16.9582  661522.6  661511.3
## CP004 218.9821 218.9818 218.9823 20.7363 16.9582 24.4925  462471.5  462462.7
## CP005 218.9817 218.9805 218.9826 43.3485 40.8326 46.1335  136005.4  136000.1
##           maxo    sn sample
## CP001 65368.86 65368      1
## CP002 79109.52 79109      1
## CP003 71700.71 71700      1
## CP004 60961.76 60961      1
## CP005 32284.30 32283      1

ggplotly(mf_peaks %>% 
  as.data.frame() %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz), col = "red", pch = 3, alpha = 0.5) + 
  geom_point(cw_peaks %>% as_tibble(), mapping = aes(x = rt, y = mz), col = "blue", pch = 1, alpha = 0.5) +
  theme_bw())

Different, isn’t it? In some cases the two algorithms are coherent, in others the results are markedly different. Centwave it is also giving this strange horizontal line at large rt

Let’s give a look to our subregion:

sub_peaks_cw <- cw_peaks %>% 
  as.data.frame() %>% 
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) 

ggplotly(tibble(rt = rt, mz = mz, I = I)  %>% 
  unnest(c("mz","I")) %>%
  filter(mz > 284 & mz < 300) %>% 
  filter(rt > 200 & rt < 300) %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(I), size = I)) + 
  geom_point(data = sub_peaks_cw, aes(x = rt, y = mz), col = "red", pch = 4, size = 3) + 
  scale_color_viridis_c() + 
  theme_light())

Comments

For high resolution spectra as the one we have centWave is non to be more reliable results
In the first steps of your analysis you will tune-check-tune the parameters to find a reasonable compromise
Your experience will drive this optimization: do you see molecules that should be there (standards, known compounds) in the peaklist?

Peak Picking all Dataset

In real life situations, when you are happy with your peak picking parameters you run the process on all your samples. xcms is designed for that, the only thing you have to do is to load more than on e file.

In view of the step in which the peak list will be merged together it is important also to add to the object containing the raw data any type of meta information which is associated to the samples. Typical meta information includes:

the filename
the class of the sample (QC, blank, sample, standard)
the study factors (treated, control, time, color, …)
the batch and/or injection order (to spot analytical drifts)

In xcms all these infos are stored in an AnnotatedDataFrame object.

Typically all these infos should be included in the filename, or in a csv which links the filename with the metadata. In the dataset we are using everything is in the file name, so let’s process them. In the following we will assume that the data are stored as mzML inside the data subfolder.

library(tools)  ## just to use file_path_sans_ext
phenodata <- tibble(filepath = list.files("../data/wines/", pattern = ".mzML", full.names = TRUE)) %>% 
  filter(!grepl("mix",filepath)) %>%     ## remove the injections of the standard mixes 
  mutate(fname = file_path_sans_ext(basename(filepath))) %>% 
  separate(fname, into = c("inj_ord", "color", "variety", "bottle", "rep", "polarity", "mode"), 
           sep = "_", remove = FALSE) %>% 
  as(.,"AnnotatedDataFrame")

## to see the content 
pData(phenodata)

## # A tibble: 29 × 9
##    filepath              fname inj_ord color variety bottle rep   polarity mode 
##    <chr>                 <chr> <chr>   <chr> <chr>   <chr>  <chr> <chr>    <chr>
##  1 ../data/wines/x001_X… x001… x001    X     blank   X      1     NEG      DDA  
##  2 ../data/wines/x002_X… x002… x002    X     QC      X      1     NEG      DDA  
##  3 ../data/wines/x003_X… x003… x003    X     QC      X      2     NEG      DDA  
##  4 ../data/wines/x004_r… x004… x004    red   sangio  B      1     NEG      DDA  
##  5 ../data/wines/x005_r… x005… x005    red   ptnoir  A      1     NEG      DDA  
##  6 ../data/wines/x006_r… x006… x006    red   merlot  A      1     NEG      DDA  
##  7 ../data/wines/x007_w… x007… x007    wht   vermen  A      1     NEG      DDA  
##  8 ../data/wines/x008_r… x008… x008    red   cannon  A      1     NEG      DDA  
##  9 ../data/wines/x009_w… x009… x009    wht   chardo  B      1     NEG      DDA  
## 10 ../data/wines/x010_w… x010… x010    wht   chardo  A      1     NEG      DDA  
## # ℹ 19 more rows

Here we read the data in:

raw_data <- readMSData(
  files = phenodata$filepath, 
  msLevel. = 1,
  pdata = phenodata, ## this is the structure of xcms holding phenotypic data
  mode = "onDisk")  ## with this parameter the data are not loaded into RAM

And perform the peak picking with centwave:

register(SerialParam()) ## Setting xcms in serial mode
raw_data <- findChromPeaks(raw_data, param = cwp)

## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 41 regions of interest ... OK: 23 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 732 regions of interest ... OK: 705 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 733 regions of interest ... OK: 731 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 760 regions of interest ... OK: 721 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 700 regions of interest ... OK: 649 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 860 regions of interest ... OK: 792 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 556 regions of interest ... OK: 712 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 794 regions of interest ... OK: 715 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 508 regions of interest ... OK: 585 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 754 regions of interest ... OK: 839 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 737 regions of interest ... OK: 615 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 754 regions of interest ... OK: 802 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 684 regions of interest ... OK: 768 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 44 regions of interest ... OK: 27 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 728 regions of interest ... OK: 693 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 753 regions of interest ... OK: 706 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 546 regions of interest ... OK: 605 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 909 regions of interest ... OK: 857 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 506 regions of interest ... OK: 588 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 634 regions of interest ... OK: 790 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 817 regions of interest ... OK: 697 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 839 regions of interest ... OK: 799 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 744 regions of interest ... OK: 819 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 723 regions of interest ... OK: 664 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 751 regions of interest ... OK: 661 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 688 regions of interest ... OK: 775 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 38 regions of interest ... OK: 13 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 742 regions of interest ... OK: 732 found.
## Detecting mass traces at 15 ppm ... OK
## Detecting chromatographic peaks in 764 regions of interest ... OK: 694 found.

To spare time we save the picked object …

save(raw_data, file = "wines.RData")

Let’s give a look to the content of the full dataset:

raw_data

## MSn experiment data ("XCMSnExp")
## Object size in memory: 4.41 Mb
## - - - Spectra data - - -
##  MS level(s): 1 
##  Number of spectra: 15688 
##  MSn retention times: 0:01 - 12:00 minutes
## - - - Processing information - - -
## Data loaded [Tue Nov 14 11:28:12 2023] 
## Filter: select MS level(s) 1. [Tue Nov 14 11:28:12 2023] 
##  MSnbase version: 2.28.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: 1 2 ... 3 (29 total)
##   varLabels: filepath fname ... mode (9 total)
##   varMetadata: labelDescription
## Loaded from:
##   [1] x001_X_blank_X_1_NEG_DDA.mzML...  [29] x029_X_QC_X_6_NEG_DDA.mzML
##   Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
##   featureNames: F01.S0001 F01.S0004 ... F29.S1627 (15688 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
## - - - xcms preprocessing - - -
## Chromatographic peak detection:
##  method: centWave 
##  18777 peaks identified in 29 samples.
##  On average 647 chromatographic peaks per sample.

Here we are dealing with 29 files with different characteristics. After the peak picking optimization step we are confident that the parameters we have been choosing should give reasonably good results. Before going on, it is however useful to make some additional checks on the data quality.

The first thing to do is to monitor the TIC of the 29 experiments to spot potential drops of the signal during the analysis:

tics <- chromatogram(raw_data, 
                     aggregationFun = "sum", ## "sum" for "TIC" / "max" for "BPC"
                     include = "none")     ## this argument is required to avoid 
                                           ## including all the chromatograms of the picked peaks

We now plot them, with colors matching the sample class:

mypalette <- c("steelblue", "coral", "darkgreen")
names(mypalette) <- c("red","X","wht")

plot(tics, col = mypalette[raw_data$color])  ## raw_data$color is getting out the phenodata column called color
legend("topright", legend = c("red","Blank & QC","white"), col = mypalette, lty = 1)

The plot shows already nice things:

QCs are nicely superimposed
The two colors are well separated

The trend with the injection can be visualized as follows:

## Visualize the overall trend in TIC along the injections
tc <- split(tic(raw_data), f = fromFile(raw_data))
full_tics <- sapply(tc, sum)

pData(raw_data) %>% 
  tibble() %>% 
  add_column(full_tics = full_tics) %>% 
  ggplot() + 
  geom_point(aes(x = inj_ord, y = full_tics, col = color), size = 2) + 
  theme_bw() + 
  theme(aspect.ratio = 0.3)

So no clear trend is visible in the data. X here represents QC, blanks and StdMixes

Another type of visualization that I find useful with reasonably small datasets is the following:

chromPeaks(raw_data) %>% 
  as_tibble() %>% 
  ggplot() + 
  geom_point(aes(x = rt, y = mz, col = log10(maxo)), size = 1, alpha = 0.5) + 
  scale_color_viridis_c() + 
  facet_wrap(~factor(sample)) + 
  theme_bw() + 
  theme(aspect.ratio = 0.5)

Here we see the peak maps of the different files. It is clear that each map is expected to be slightly different, but rally outlying samples will show up clearly. Look to the blanks (i.e., samples 1, 14, 27), for example, or to the horizontal lines which are present in many files.

DIY

Something for you now:

Play around with the pick picking parameters and discuss what is the meaning of each one of them.
Try to use what we have done so far to understand what happens in correspondence of the horizontal lines visible in many samples. To do that you should find the “troublesome” mz and extract its chromatogram across the full set of samples.
Can you try to check the mz and rt of the same peak across the 29 samples (use the filterFile function to get rid of the blanks!)? Are they the same?