Annotation

library(MetaboCoreUtils)
library(xcms)
library(MetaboAnnotation)
library(RColorBrewer)
library(plotly)

Introduction

The objective of this demo is to perform compound annotation based on an in-house database created with the information gathered from the injection of a series of authentic standards under the same analytical protocol.
For that we will use the file from “mix 2”, which contain the following compounds:

Malvidine-3,5-diglucoside, C29H35O17+
Quercitrin-3-rhamnoside, C21H20O11
Laricitrin, C16H12O8
Quercetin-3-galactoside, C21H20O12
o-Coumaric acid, C9H8O3
Gallic acid, C7H6O5
Procyanidin B3, C30H26O12
Procyanidin A2, C30H24O12

Create the in-house library

The first step is to create our library with the information of the standards.
For that, we’re going to create a reference database with the m/z values corresponding to [M-H]- ion, which is the most common adduct usually we see with our analytical setup.

stds_tb <- data.frame(
  rbind(
    c("Malvidine-3,5-diglucoside", "C29H34O17"),  
    c("Quercitrin-3-rhamnoside", "C21H20O11"),
    c("Laricitrin", "C16H12O8"),
    c("Quercetin-3-galactoside", "C21H20O12"),
    c("o-Coumaric acid", "C9H8O3"),
    c("Gallic acid", "C7H6O5"),
    c("Procyanidin B3", "C30H26O12"),
    c("Procyanidin A2", "C30H24O12"))
)
colnames(stds_tb) <- c("name", "formula")

To calculate masses and m/z values we are going to use different functions included in the package MetaboCoreUtils:

stds_tb$mass <- calculateMass(stds_tb$formula)
stds_tb$mz <- as.numeric(mass2mz(stds_tb$mass, "[M-H]-"))
stds_tb

##                        name   formula     mass       mz
## 1 Malvidine-3,5-diglucoside C29H34O17 654.1796 653.1723
## 2   Quercitrin-3-rhamnoside C21H20O11 448.1006 447.0933
## 3                Laricitrin  C16H12O8 332.0532 331.0459
## 4   Quercetin-3-galactoside C21H20O12 464.0955 463.0882
## 5           o-Coumaric acid    C9H8O3 164.0473 163.0401
## 6               Gallic acid    C7H6O5 170.0215 169.0142
## 7            Procyanidin B3 C30H26O12 578.1424 577.1352
## 8            Procyanidin A2 C30H24O12 576.1268 575.1195

Ok, here we have the table with the theoretical m/z values that should be found in our sample considering the compounds included in it.

Next step is to determine their retention times. For that, now we are going to do chromatographic peak detection in the file from the injection that contains the mentioned analytical standards.

# Import the file
std_files <- list.files("../data/wines/", pattern = "mix02", full.names = TRUE)
std_raw <- readMSData(std_files, 
                      msLevel. = 1, ## we read only MS1 
                      mode = "onDisk")
# Perform peak picking:
cwp <- CentWaveParam(peakwidth = c(5, 30),    ## expected range of chromatographic peak width
                     ppm = 5,                 ## tolerance to identify ROIs in the mz/rt plane
                     prefilter = c(5, 50000),## number of consecutive scans showing a signal higher than 50000
                     noise = 5000)   
register(SerialParam()) ## Setting xcms in serial mode
std_cw <- findChromPeaks(std_raw, param = cwp)

## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 138 regions of interest ... OK: 104 found.

In total 104 chromatographic peaks have been identified. This is even more than would be expected since the sample represents a mixture of 8 pure standards…. Do you have any idea that could explain the reason(s) why a number of peaks much greater than the number of compounds contained in this sample has been detected?

Below we display some of the detected chromatographic peaks:

std_pks <- chromPeaks(std_cw)
head(std_pks)

##             mz    mzmin    mzmax       rt    rtmin    rtmax      into      intb
## CP001 149.0093 149.0092 149.0093  53.8847  48.8753  63.7131  697822.7  697809.5
## CP002 125.0246 125.0246 125.0247  53.8847  50.5242  62.5046 1318548.2 1289315.0
## CP003 169.0143 169.0142 169.0143  53.8847  50.5242  62.5046 2680096.3 2555866.1
## CP004 125.0246 125.0246 125.0247  92.5009  87.5114  98.7116 1693969.3 1654649.3
## CP005 169.0143 169.0143 169.0143  92.5009  87.5114  98.7116 3974028.5 3928392.4
## CP006 671.1821 671.1816 671.1826 239.6843 230.2282 246.7447 1232615.2 1225886.0
##            maxo    sn sample
## CP001  86023.87 86023      1
## CP002 203597.83    31      1
## CP003 390782.22    20      1
## CP004 214933.00    32      1
## CP005 559039.25    36      1
## CP006 116713.79    49      1

We next annotate these chromatographic peaks using only their m/z values. For doing this we are going to use a series of functions included in the MetaboAnnotation. To annotate the detected peaks we use the function matchMz against the reference database we’ve just created:

pks_match <- matchMz(
  std_pks, stds_tb, param = MzParam(ppm = 10)
)
pks_match

## Object of class Matched 
## Total number of matches: 8 
## Number of query objects: 104 (8 matched)
## Number of target objects: 8 (7 matched)

The result object handles the potential many-to-many matching between chromatographic peaks (query) and reference (target) and contains all information from both the query and the target object along with the score for the match (in this case the difference between the m/z values).
For example, here we can see that from the 8 reference compounds, 7 were found among the detected chromatographic peaks.
We can extract the full matching table with the function matchedData. This returns a table with all queries (i.e., chromatographic peaks) in the rows and the corresponding matches indicated in the columns which names start with “target_”. Note that if a row in query (i.e., chromatographic peak) matches multiple elements in target (i.e., reference database), this row will be duplicated in the returned table. For rows that can not be matched NA values are reported.

tb_match <- matchedData(pks_match)
tb_match

## DataFrame with 104 rows and 17 columns
##              mz     mzmin     mzmax        rt     rtmin     rtmax      into
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CP001   149.009   149.009   149.009   53.8847   48.8753   63.7131    697823
## CP002   125.025   125.025   125.025   53.8847   50.5242   62.5046   1318548
## CP003   169.014   169.014   169.014   53.8847   50.5242   62.5046   2680096
## CP004   125.025   125.025   125.025   92.5009   87.5114   98.7116   1693969
## CP005   169.014   169.014   169.014   92.5009   87.5114   98.7116   3974028
## ...         ...       ...       ...       ...       ...       ...       ...
## CP100   288.936   288.936   288.936   718.178   714.144   719.685   2064367
## CP101   294.902   294.902   294.902   718.178   712.868   719.685    964329
## CP102   304.914   304.914   304.914   718.178   714.144   719.685   1962710
## CP103   311.168   311.168   311.168   699.506   685.236   718.178  11637225
## CP104   312.172   312.171   312.172   699.506   681.960   718.178   2287416
##            intb      maxo        sn    sample   target_name target_formula
##       <numeric> <numeric> <numeric> <numeric>   <character>    <character>
## CP001    697809   86023.9     86023         1            NA             NA
## CP002   1289315  203597.8        31         1            NA             NA
## CP003   2555866  390782.2        20         1 Gallic aci...         C7H6O5
## CP004   1654649  214933.0        32         1            NA             NA
## CP005   3928392  559039.2        36         1 Gallic aci...         C7H6O5
## ...         ...       ...       ...       ...           ...            ...
## CP100   2051862    651686       134         1            NA             NA
## CP101    934061    200990        32         1            NA             NA
## CP102   1943342    681089       113         1            NA             NA
## CP103  11454235    597291        74         1            NA             NA
## CP104   2240545    122710        32         1            NA             NA
##       target_mass target_mz       score ppm_error
##         <numeric> <numeric>   <numeric> <numeric>
## CP001          NA        NA          NA        NA
## CP002          NA        NA          NA        NA
## CP003     170.022   169.014 1.12462e-05 0.0665399
## CP004          NA        NA          NA        NA
## CP005     170.022   169.014 4.79129e-05 0.2834846
## ...           ...       ...         ...       ...
## CP100          NA        NA          NA        NA
## CP101          NA        NA          NA        NA
## CP102          NA        NA          NA        NA
## CP103          NA        NA          NA        NA
## CP104          NA        NA          NA        NA

As we can see, not all chromatographic peaks were annotated, whereas in other cases multiple chromatographic peaks were assigned to the same compound.
Let’s subset this data only with annotated chromatographic peaks:

tb_match <- tb_match[!is.na(tb_match$target_name),]
tb_match

## DataFrame with 8 rows and 17 columns
##              mz     mzmin     mzmax        rt     rtmin     rtmax      into
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CP003   169.014   169.014   169.014   53.8847   50.5242   62.5046   2680096
## CP005   169.014   169.014   169.014   92.5009   87.5114   98.7116   3974028
## CP007   577.134   577.134   577.135  242.1512  236.1621  249.6089   2084342
## CP010   463.088   463.087   463.088  327.1548  322.4942  332.4844  12252035
## CP011   575.119   575.119   575.119  333.6979  327.1548  343.9343   5355539
## CP012   447.093   447.093   447.093  353.9643  349.3436  358.9049  12208440
## CP013   163.040   163.040   163.040  367.2148  362.6492  372.9214   5967047
## CP022   331.046   331.046   331.046  425.6146  421.1173  432.9044  19778108
##            intb      maxo        sn    sample   target_name target_formula
##       <numeric> <numeric> <numeric> <numeric>   <character>    <character>
## CP003   2555866    390782        20         1 Gallic aci...         C7H6O5
## CP005   3928392    559039        36         1 Gallic aci...         C7H6O5
## CP007   2084327    331704    331703         1 Procyanidi...      C30H26O12
## CP010  12242769   3713275      1014         1 Quercetin-...      C21H20O12
## CP011   5355522    845167    845166         1 Procyanidi...      C30H24O12
## CP012  12184820   3936981       360         1 Quercitrin...      C21H20O11
## CP013   5960745   1597530       699         1 o-Coumaric...         C9H8O3
## CP022  19699822   3990305       376         1    Laricitrin       C16H12O8
##       target_mass target_mz        score ppm_error
##         <numeric> <numeric>    <numeric> <numeric>
## CP003     170.022   169.014  1.12462e-05 0.0665399
## CP005     170.022   169.014  4.79129e-05 0.2834846
## CP007     578.142   577.135 -6.91385e-04 1.1979599
## CP010     464.095   463.088 -4.35508e-04 0.9404437
## CP011     576.127   575.120 -4.71652e-04 0.8200931
## CP012     448.101   447.093 -4.23341e-04 0.9468739
## CP013     164.047   163.040  1.12839e-04 0.6920949
## CP022     332.053   331.046 -3.20979e-04 0.9695903

Column “score” provides the difference between the query and target m/z values (in Da) and “ppm_error” the ppm error.
As you can see, gallic acid was annotated twice, whereas malvidine-3,5-diglucoside was not detected. The second case was something already expected since usually anthocyanins do not ionizate so well in negative ion mode. Regarding gallic acid, we are going to consider the one with the highest intensity.

dupl <- which(duplicated(tb_match$target_name))
idx <- which(tb_match$target_name == tb_match$target_name[dupl])
tb_match <- rbind(tb_match[-idx,],
                  tb_match[idx[which.max(tb_match$maxo[idx])],])

At this point we can expand our reference database with the retention time information for each analytical standard:

stds_tb$RT <- NA
for(i in seq(nrow(stds_tb))){
  idx <- which(tb_match$target_name == stds_tb$name[i])
  if(length(idx) > 0){
    stds_tb$RT[i] <- tb_match$rt[idx]
  }
}
stds_tb

##                        name   formula     mass       mz       RT
## 1 Malvidine-3,5-diglucoside C29H34O17 654.1796 653.1723       NA
## 2   Quercitrin-3-rhamnoside C21H20O11 448.1006 447.0933 353.9643
## 3                Laricitrin  C16H12O8 332.0532 331.0459 425.6146
## 4   Quercetin-3-galactoside C21H20O12 464.0955 463.0882 327.1548
## 5           o-Coumaric acid    C9H8O3 164.0473 163.0401 367.2148
## 6               Gallic acid    C7H6O5 170.0215 169.0142  92.5009
## 7            Procyanidin B3 C30H26O12 578.1424 577.1352 242.1512
## 8            Procyanidin A2 C30H24O12 576.1268 575.1195 333.6979

Finally, below we are going to check if in this injection any MSMS spectra for the included standards was recorded:

# load MSMS data:
std_MSMS <- readMSData(std_files, 
                       msLevel. = 2, ## we read only MS1 
                       mode = "onDisk")
# create a query data.frame of MSMS data:
std_MSMS_tb <- data.frame(
  mz = precursorMz(std_MSMS),
  rtime = rtime(std_MSMS)
)
# perform annoation based on precursorMz values:
MSMS_match <- matchMz(
  std_MSMS_tb, stds_tb, param = MzParam(ppm = 10)
)
MSMS_match

## Object of class Matched 
## Total number of matches: 3 
## Number of query objects: 522 (3 matched)
## Number of target objects: 8 (3 matched)

MSMS_match_tb <- matchedData(MSMS_match)
MSMS_match_tb <- MSMS_match_tb[!is.na(MSMS_match_tb$target_name),]
MSMS_match_tb

## DataFrame with 3 rows and 9 columns
##                 mz     rtime   target_name target_formula target_mass target_mz
##          <numeric> <numeric>   <character>    <character>   <numeric> <numeric>
## F1.S0185   169.014   89.1859 Gallic aci...         C7H6O5     170.022   169.014
## F1.S0530   577.135  243.0483 Procyanidi...      C30H26O12     578.142   577.135
## F1.S0812   163.040  370.8306 o-Coumaric...         C9H8O3     164.047   163.040
##          target_RT        score ppm_error
##          <numeric>    <numeric> <numeric>
## F1.S0185   92.5009  3.49346e-05  0.206696
## F1.S0530  242.1512 -5.67752e-04  0.983743
## F1.S0812  367.2148  1.38793e-04  0.851283

We can see that in this sample it was recorded the MSMS of 3 of the included analytical standards. Let’s keep this information because later it will be useful to compare these MSMS with the ones collected in the wine samples in case these compounds are found in them (in order to confirm their annotation with a higher confidence).

std_MSMS <- std_MSMS[featureNames(std_MSMS) %in% rownames(MSMS_match_tb)]

Annotation of study samples

At this point we are ready to see if any of the compounds included in the “mix 2” are observed in the study samples.

Below we load the processed data:

load("processed_wines_data.RData")
head(feat)

## DataFrame with 6 rows and 12 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001   56.5699   56.5698   56.5699  274.4321  273.0029  275.6861         8
## FT002   59.0140   59.0139   59.0141   52.5710   51.2949   53.7408        17
## FT003   65.6779   65.6779   65.6780   61.2852   60.2421   68.9642        10
## FT004   65.6779   65.6779   65.6779   79.8001   78.4873   80.2757         9
## FT005   68.4565   68.4564   68.4565  182.2209  179.7875  183.5496        17
## FT006   68.4565   68.4564   68.4565  163.1186  161.3699  171.1034        10
##             red       wht         X            peakidx  ms_level
##       <numeric> <numeric> <numeric>             <list> <integer>
## FT001         8         0         0 1875,2539,3298,...         1
## FT002         6         6         5  405, 730,1460,...         1
## FT003         9         0         0 1546,2883,4377,...         1
## FT004         9         0         0 1545,2882,4376,...         1
## FT005         3         8         6  292,1010,1713,...         1
## FT006         2         2         6  293,1011,1714,...         1

We can now annotate the detected features using our reference table considering this time the information of both RT and m/z by the use of the functions matchValues and MzRtParam.

ft_match <- matchValues(
  feat, stds_tb, param = MzRtParam(ppm = 10, toleranceRt = 5),
  mzColname = c("mzmed", "mz"),
  rtColname  = c("rtmed", "RT")
)
ft_match

## Object of class Matched 
## Total number of matches: 4 
## Number of query objects: 734 (4 matched)
## Number of target objects: 8 (4 matched)

We can see that 4/8 compounds included in the standard mix 2 are observed in wine samples. Let’s see which compounds are they:

tb_match <- matchedData(ft_match)
tb_match <- tb_match[!is.na(tb_match$target_name),]
tb_match

## DataFrame with 4 rows and 20 columns
##           mzmed     mzmin     mzmax     rtmed     rtmin     rtmax    npeaks
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT168   169.014   169.014   169.014   91.7997   90.9623   93.4497        26
## FT477   331.045   331.045   331.045  426.7226  425.3632  427.2577        16
## FT636   463.087   463.087   463.089  327.6402  324.7869  337.8187        16
## FT689   577.134   577.133   577.135  244.1201  242.8927  245.4458        12
##             red       wht         X            peakidx  ms_level   target_name
##       <numeric> <numeric> <numeric>             <list> <integer>   <character>
## FT168        10        10         6  306,1023,1743,...         1 Gallic aci...
## FT477        10         0         6  620,1368,2055,...         1    Laricitrin
## FT636         7         2         6  530,1260,1967,...         1 Quercetin-...
## FT689         5         2         5  480,1209,2619,...         1 Procyanidi...
##       target_formula target_mass target_mz target_RT        score ppm_error
##          <character>   <numeric> <numeric> <numeric>    <numeric> <numeric>
## FT168         C7H6O5     170.022   169.014   92.5009 -6.31429e-05  0.373595
## FT477       C16H12O8     332.053   331.046  425.6146 -8.03120e-04  2.426009
## FT636      C21H20O12     464.095   463.088  327.1548 -9.31859e-04  2.012271
## FT689      C30H26O12     578.142   577.135  242.1512 -1.59914e-03  2.770823
##        score_rt
##       <numeric>
## FT168 -0.701161
## FT477  1.108043
## FT636  0.485431
## FT689  1.968879

We can see that considering both m/z and RT values, gallic acid, laricitrin, quercetin-3-galactoside and procyanidin B3 are observed in the wine samples.
Below we plot the feature of gallic acid in study samples, indicating with a vertical line the RT determined by the injection of the corresponding analytical standard:

ft_chr <- featureChromatograms(raw_data_filled, 
                               features = "FT168", 
                               expandRt = 15, 
                               filled = FALSE)
group_colors <- paste0(brewer.pal(3, "Set1"), "60")
names(group_colors) <- c("red","wht","X")
sample_colors <- group_colors[raw_data_filled$color]
plot(ft_chr, col = group_colors[raw_data_filled$color],
     peakBg = sample_colors[chromPeaks(ft_chr)[, "sample"]])
abline(v = stds_tb$RT[stds_tb$name == "Gallic acid"], lwd = 10, col = "#D2D2D290")

Finally, let’s compare the MSMS pattern of gallic acid observed in the analtyical standard with the MSMS of this feature recorded within the study samples:

# get the MSMS of the standard:
ga_MSMS <- std_MSMS[[which(featureNames(std_MSMS) == rownames(MSMS_match_tb)[MSMS_match_tb$target_name == "Gallic acid"])]]

# get all MSMS in study samples with the m/z and RT of gallic acid:
MSMS_data <- readMSData(
  files = raw_data_filled$filepath, 
  msLevel. = 2,
  mode = "onDisk") 
ga_MSMS_data <- MSMS_data %>%
  filterPrecursorMz(169.0142, ppm = 10) %>%
  filterRt(92.5 + 5 * c(-1, 1))
ga_MSMS_data

## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.04 Mb
## - - - Spectra data - - -
##  MS level(s): 2 
##  Number of spectra: 10 
##  MSn retention times: 1:28 - 1:37 minutes
## - - - Processing information - - -
## Data loaded [Tue Nov 14 17:53:35 2023] 
## Filter: select MS level(s) 2. [Tue Nov 14 17:53:35 2023] 
## Filter: select by precursor m/z: 169.0142. [Tue Nov 14 17:53:35 2023] 
## Filter: select retention time [87.5-97.5] and MS level(s), 2 [Tue Nov 14 17:53:35 2023] 
##  MSnbase version: 2.28.1 
## - - - Meta data  - - -
## phenoData
##   rowNames: x005_red_ptnoir_A_1_NEG_DDA.mzML
##     x009_wht_chardo_B_1_NEG_DDA.mzML ...
##     x026_wht_chardo_C_1_NEG_DDA.mzML (7 total)
##   varLabels: sampleNames
##   varMetadata: labelDescription
## Loaded from:
##   [1] x005_red_ptnoir_A_1_NEG_DDA.mzML...  [7] x026_wht_chardo_C_1_NEG_DDA.mzML
##   Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
##   featureNames: F04.S0230 F08.S0212 ... F24.S0227 (10 total)
##   fvarLabels: fileIdx spIdx ... spectrum (35 total)
##   fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'

Within study samples there are a total of 10 spectras with a precursor m/z value and RT similar to the one observed for gallic acid.
Below we compare through a mirror plot the MSMS observed in study samples (upper plot) versus the MSMS observe in the analytical standard (bottomp lot, red color):

ggplotly(ggplot() +
           geom_segment(aes(x = mz(ga_MSMS), xend = mz(ga_MSMS), 
                            y = 0, yend = (intensity(ga_MSMS)/max(intensity(ga_MSMS)))*(-1)), 
                        color = "red") +
           geom_segment(aes(x = mz(ga_MSMS_data)[[1]], xend = mz(ga_MSMS_data)[[1]], 
                            y = 0, yend = intensity(ga_MSMS_data)[[1]]/max(intensity(ga_MSMS_data)[[1]]))) +
           ylab("relative intensity") + xlab("m/z") +
           annotate(geom = "text", x = 160, y = 0.8, cex = 3, 
                    label = ga_MSMS_data@phenoData@data[ga_MSMS_data[[1]]@fromFile, 1]) +
           annotate(geom = "text", x = 160, y = -0.8, cex = 3, 
                    label = basename(fileNames(std_MSMS)[ga_MSMS@fromFile]), color = "red")
)

DIY

Check the peak shapes of the other annotated features.
Compare the MSMS of the other annotated features for which at least one MSMS spectra from standards was recorded.
Annotate other compounds based on the standards included in mix 03 and/or mix 04:
- MIX 03:
  - Petunidine-3-glucoside, C22H23O12+
  - Malvidine, C17H15O7+
  - Cyanidin-3,5-diglucoside, C27H31O16+
  - Syringetin-3-glucoside, C23H24O13
  - Quercetin-3-o-glucopyranoside, C21H20O12
  - Kaempferol, C15H10O6
  - Myricetin, C15H10O8
  - Cinnamic acid, C9H8O2
  - Benzoic acid, C7H6O2
  - Ellagic acid, C14H6O8
  - (+)Catechin, C15H14O6
  - Procyanidin B2, C30H26O12
  - Malvidine-3,5-diglucoside, C29H35O17+
  - Epigallocatechine, C15H14O7
- MIX 04:
  - Kaempferol-3-glucoside, C21H20O11
  - p-Coumaric acid, C9H8O3
  - 2,4-Dihidroxibenzoic acid, C7H6O4
  - Caffeic acid ethylester, C11H12O4
  - (-)Catechin-3-O-gallate, C22H18O10
  - Epigallocatechine gallate, C22H18O11
  - (-)Gallocatechin, C15H14O7
  - Isorhamnetin-3-rutinoside, C28H32O16
  - Procyanidin B1, C30H26O12

Annotation

Mar Garcia-Aloy, Pietro Franceschi

2023-11-14

Introduction

Create the in-house library

Annotation of study samples

DIY