library(MetaboCoreUtils)
library(xcms)
library(MetaboAnnotation)
library(RColorBrewer)
library(plotly)
The objective of this demo is to perform compound annotation based on
an in-house database created with the information gathered from the
injection of a series of authentic standards under the same analytical
protocol.
For that we will use the file from “mix 2”, which contain the following
compounds:
The first step is to create our library with the information of the
standards.
For that, we’re going to create a reference database with the m/z values
corresponding to [M-H]- ion, which is the most common adduct usually we
see with our analytical setup.
stds_tb <- data.frame(
rbind(
c("Malvidine-3,5-diglucoside", "C29H34O17"),
c("Quercitrin-3-rhamnoside", "C21H20O11"),
c("Laricitrin", "C16H12O8"),
c("Quercetin-3-galactoside", "C21H20O12"),
c("o-Coumaric acid", "C9H8O3"),
c("Gallic acid", "C7H6O5"),
c("Procyanidin B3", "C30H26O12"),
c("Procyanidin A2", "C30H24O12"))
)
colnames(stds_tb) <- c("name", "formula")
To calculate masses and m/z values we are going to use different
functions included in the package MetaboCoreUtils
:
stds_tb$mass <- calculateMass(stds_tb$formula)
stds_tb$mz <- as.numeric(mass2mz(stds_tb$mass, "[M-H]-"))
stds_tb
## name formula mass mz
## 1 Malvidine-3,5-diglucoside C29H34O17 654.1796 653.1723
## 2 Quercitrin-3-rhamnoside C21H20O11 448.1006 447.0933
## 3 Laricitrin C16H12O8 332.0532 331.0459
## 4 Quercetin-3-galactoside C21H20O12 464.0955 463.0882
## 5 o-Coumaric acid C9H8O3 164.0473 163.0401
## 6 Gallic acid C7H6O5 170.0215 169.0142
## 7 Procyanidin B3 C30H26O12 578.1424 577.1352
## 8 Procyanidin A2 C30H24O12 576.1268 575.1195
Ok, here we have the table with the theoretical m/z values that should be found in our sample considering the compounds included in it.
Next step is to determine their retention times. For that, now we are going to do chromatographic peak detection in the file from the injection that contains the mentioned analytical standards.
# Import the file
std_files <- list.files("../data/wines/", pattern = "mix02", full.names = TRUE)
std_raw <- readMSData(std_files,
msLevel. = 1, ## we read only MS1
mode = "onDisk")
# Perform peak picking:
cwp <- CentWaveParam(peakwidth = c(5, 30), ## expected range of chromatographic peak width
ppm = 5, ## tolerance to identify ROIs in the mz/rt plane
prefilter = c(5, 50000),## number of consecutive scans showing a signal higher than 50000
noise = 5000)
register(SerialParam()) ## Setting xcms in serial mode
std_cw <- findChromPeaks(std_raw, param = cwp)
## Detecting mass traces at 5 ppm ... OK
## Detecting chromatographic peaks in 138 regions of interest ... OK: 104 found.
In total 104 chromatographic peaks have been identified. This is even more than would be expected since the sample represents a mixture of 8 pure standards…. Do you have any idea that could explain the reason(s) why a number of peaks much greater than the number of compounds contained in this sample has been detected?
Below we display some of the detected chromatographic peaks:
std_pks <- chromPeaks(std_cw)
head(std_pks)
## mz mzmin mzmax rt rtmin rtmax into intb
## CP001 149.0093 149.0092 149.0093 53.8847 48.8753 63.7131 697822.7 697809.5
## CP002 125.0246 125.0246 125.0247 53.8847 50.5242 62.5046 1318548.2 1289315.0
## CP003 169.0143 169.0142 169.0143 53.8847 50.5242 62.5046 2680096.3 2555866.1
## CP004 125.0246 125.0246 125.0247 92.5009 87.5114 98.7116 1693969.3 1654649.3
## CP005 169.0143 169.0143 169.0143 92.5009 87.5114 98.7116 3974028.5 3928392.4
## CP006 671.1821 671.1816 671.1826 239.6843 230.2282 246.7447 1232615.2 1225886.0
## maxo sn sample
## CP001 86023.87 86023 1
## CP002 203597.83 31 1
## CP003 390782.22 20 1
## CP004 214933.00 32 1
## CP005 559039.25 36 1
## CP006 116713.79 49 1
We next annotate these chromatographic peaks using only their m/z
values. For doing this we are going to use a series of functions
included in the MetaboAnnotation
.
To annotate the detected peaks we use the function matchMz
against the reference database we’ve just created:
pks_match <- matchMz(
std_pks, stds_tb, param = MzParam(ppm = 10)
)
pks_match
## Object of class Matched
## Total number of matches: 8
## Number of query objects: 104 (8 matched)
## Number of target objects: 8 (7 matched)
The result object handles the potential many-to-many matching between
chromatographic peaks (query) and reference (target)
and contains all information from both the query and the target object
along with the score for the match (in this case the difference between
the m/z values).
For example, here we can see that from the 8 reference compounds, 7 were
found among the detected chromatographic peaks.
We can extract the full matching table with the function
matchedData
. This returns a table with all queries (i.e.,
chromatographic peaks) in the rows and the corresponding matches
indicated in the columns which names start with “target_”. Note that if
a row in query (i.e., chromatographic peak) matches multiple elements in
target (i.e., reference database), this row will be duplicated in the
returned table. For rows that can not be matched NA values are
reported.
tb_match <- matchedData(pks_match)
tb_match
## DataFrame with 104 rows and 17 columns
## mz mzmin mzmax rt rtmin rtmax into
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CP001 149.009 149.009 149.009 53.8847 48.8753 63.7131 697823
## CP002 125.025 125.025 125.025 53.8847 50.5242 62.5046 1318548
## CP003 169.014 169.014 169.014 53.8847 50.5242 62.5046 2680096
## CP004 125.025 125.025 125.025 92.5009 87.5114 98.7116 1693969
## CP005 169.014 169.014 169.014 92.5009 87.5114 98.7116 3974028
## ... ... ... ... ... ... ... ...
## CP100 288.936 288.936 288.936 718.178 714.144 719.685 2064367
## CP101 294.902 294.902 294.902 718.178 712.868 719.685 964329
## CP102 304.914 304.914 304.914 718.178 714.144 719.685 1962710
## CP103 311.168 311.168 311.168 699.506 685.236 718.178 11637225
## CP104 312.172 312.171 312.172 699.506 681.960 718.178 2287416
## intb maxo sn sample target_name target_formula
## <numeric> <numeric> <numeric> <numeric> <character> <character>
## CP001 697809 86023.9 86023 1 NA NA
## CP002 1289315 203597.8 31 1 NA NA
## CP003 2555866 390782.2 20 1 Gallic aci... C7H6O5
## CP004 1654649 214933.0 32 1 NA NA
## CP005 3928392 559039.2 36 1 Gallic aci... C7H6O5
## ... ... ... ... ... ... ...
## CP100 2051862 651686 134 1 NA NA
## CP101 934061 200990 32 1 NA NA
## CP102 1943342 681089 113 1 NA NA
## CP103 11454235 597291 74 1 NA NA
## CP104 2240545 122710 32 1 NA NA
## target_mass target_mz score ppm_error
## <numeric> <numeric> <numeric> <numeric>
## CP001 NA NA NA NA
## CP002 NA NA NA NA
## CP003 170.022 169.014 1.12462e-05 0.0665399
## CP004 NA NA NA NA
## CP005 170.022 169.014 4.79129e-05 0.2834846
## ... ... ... ... ...
## CP100 NA NA NA NA
## CP101 NA NA NA NA
## CP102 NA NA NA NA
## CP103 NA NA NA NA
## CP104 NA NA NA NA
As we can see, not all chromatographic peaks were annotated, whereas
in other cases multiple chromatographic peaks were assigned to the same
compound.
Let’s subset this data only with annotated chromatographic peaks:
tb_match <- tb_match[!is.na(tb_match$target_name),]
tb_match
## DataFrame with 8 rows and 17 columns
## mz mzmin mzmax rt rtmin rtmax into
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## CP003 169.014 169.014 169.014 53.8847 50.5242 62.5046 2680096
## CP005 169.014 169.014 169.014 92.5009 87.5114 98.7116 3974028
## CP007 577.134 577.134 577.135 242.1512 236.1621 249.6089 2084342
## CP010 463.088 463.087 463.088 327.1548 322.4942 332.4844 12252035
## CP011 575.119 575.119 575.119 333.6979 327.1548 343.9343 5355539
## CP012 447.093 447.093 447.093 353.9643 349.3436 358.9049 12208440
## CP013 163.040 163.040 163.040 367.2148 362.6492 372.9214 5967047
## CP022 331.046 331.046 331.046 425.6146 421.1173 432.9044 19778108
## intb maxo sn sample target_name target_formula
## <numeric> <numeric> <numeric> <numeric> <character> <character>
## CP003 2555866 390782 20 1 Gallic aci... C7H6O5
## CP005 3928392 559039 36 1 Gallic aci... C7H6O5
## CP007 2084327 331704 331703 1 Procyanidi... C30H26O12
## CP010 12242769 3713275 1014 1 Quercetin-... C21H20O12
## CP011 5355522 845167 845166 1 Procyanidi... C30H24O12
## CP012 12184820 3936981 360 1 Quercitrin... C21H20O11
## CP013 5960745 1597530 699 1 o-Coumaric... C9H8O3
## CP022 19699822 3990305 376 1 Laricitrin C16H12O8
## target_mass target_mz score ppm_error
## <numeric> <numeric> <numeric> <numeric>
## CP003 170.022 169.014 1.12462e-05 0.0665399
## CP005 170.022 169.014 4.79129e-05 0.2834846
## CP007 578.142 577.135 -6.91385e-04 1.1979599
## CP010 464.095 463.088 -4.35508e-04 0.9404437
## CP011 576.127 575.120 -4.71652e-04 0.8200931
## CP012 448.101 447.093 -4.23341e-04 0.9468739
## CP013 164.047 163.040 1.12839e-04 0.6920949
## CP022 332.053 331.046 -3.20979e-04 0.9695903
Column “score” provides the difference between the query and target
m/z values (in Da) and “ppm_error” the ppm error.
As you can see, gallic acid was annotated twice, whereas
malvidine-3,5-diglucoside was not detected. The second case was
something already expected since usually anthocyanins do not ionizate so
well in negative ion mode. Regarding gallic acid, we are going to
consider the one with the highest intensity.
dupl <- which(duplicated(tb_match$target_name))
idx <- which(tb_match$target_name == tb_match$target_name[dupl])
tb_match <- rbind(tb_match[-idx,],
tb_match[idx[which.max(tb_match$maxo[idx])],])
At this point we can expand our reference database with the retention time information for each analytical standard:
stds_tb$RT <- NA
for(i in seq(nrow(stds_tb))){
idx <- which(tb_match$target_name == stds_tb$name[i])
if(length(idx) > 0){
stds_tb$RT[i] <- tb_match$rt[idx]
}
}
stds_tb
## name formula mass mz RT
## 1 Malvidine-3,5-diglucoside C29H34O17 654.1796 653.1723 NA
## 2 Quercitrin-3-rhamnoside C21H20O11 448.1006 447.0933 353.9643
## 3 Laricitrin C16H12O8 332.0532 331.0459 425.6146
## 4 Quercetin-3-galactoside C21H20O12 464.0955 463.0882 327.1548
## 5 o-Coumaric acid C9H8O3 164.0473 163.0401 367.2148
## 6 Gallic acid C7H6O5 170.0215 169.0142 92.5009
## 7 Procyanidin B3 C30H26O12 578.1424 577.1352 242.1512
## 8 Procyanidin A2 C30H24O12 576.1268 575.1195 333.6979
Finally, below we are going to check if in this injection any MSMS spectra for the included standards was recorded:
# load MSMS data:
std_MSMS <- readMSData(std_files,
msLevel. = 2, ## we read only MS1
mode = "onDisk")
# create a query data.frame of MSMS data:
std_MSMS_tb <- data.frame(
mz = precursorMz(std_MSMS),
rtime = rtime(std_MSMS)
)
# perform annoation based on precursorMz values:
MSMS_match <- matchMz(
std_MSMS_tb, stds_tb, param = MzParam(ppm = 10)
)
MSMS_match
## Object of class Matched
## Total number of matches: 3
## Number of query objects: 522 (3 matched)
## Number of target objects: 8 (3 matched)
MSMS_match_tb <- matchedData(MSMS_match)
MSMS_match_tb <- MSMS_match_tb[!is.na(MSMS_match_tb$target_name),]
MSMS_match_tb
## DataFrame with 3 rows and 9 columns
## mz rtime target_name target_formula target_mass target_mz
## <numeric> <numeric> <character> <character> <numeric> <numeric>
## F1.S0185 169.014 89.1859 Gallic aci... C7H6O5 170.022 169.014
## F1.S0530 577.135 243.0483 Procyanidi... C30H26O12 578.142 577.135
## F1.S0812 163.040 370.8306 o-Coumaric... C9H8O3 164.047 163.040
## target_RT score ppm_error
## <numeric> <numeric> <numeric>
## F1.S0185 92.5009 3.49346e-05 0.206696
## F1.S0530 242.1512 -5.67752e-04 0.983743
## F1.S0812 367.2148 1.38793e-04 0.851283
We can see that in this sample it was recorded the MSMS of 3 of the included analytical standards. Let’s keep this information because later it will be useful to compare these MSMS with the ones collected in the wine samples in case these compounds are found in them (in order to confirm their annotation with a higher confidence).
std_MSMS <- std_MSMS[featureNames(std_MSMS) %in% rownames(MSMS_match_tb)]
At this point we are ready to see if any of the compounds included in the “mix 2” are observed in the study samples.
Below we load the processed data:
load("processed_wines_data.RData")
head(feat)
## DataFrame with 6 rows and 12 columns
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT001 56.5699 56.5698 56.5699 274.4321 273.0029 275.6861 8
## FT002 59.0140 59.0139 59.0141 52.5710 51.2949 53.7408 17
## FT003 65.6779 65.6779 65.6780 61.2852 60.2421 68.9642 10
## FT004 65.6779 65.6779 65.6779 79.8001 78.4873 80.2757 9
## FT005 68.4565 68.4564 68.4565 182.2209 179.7875 183.5496 17
## FT006 68.4565 68.4564 68.4565 163.1186 161.3699 171.1034 10
## red wht X peakidx ms_level
## <numeric> <numeric> <numeric> <list> <integer>
## FT001 8 0 0 1875,2539,3298,... 1
## FT002 6 6 5 405, 730,1460,... 1
## FT003 9 0 0 1546,2883,4377,... 1
## FT004 9 0 0 1545,2882,4376,... 1
## FT005 3 8 6 292,1010,1713,... 1
## FT006 2 2 6 293,1011,1714,... 1
We can now annotate the detected features using our reference table
considering this time the information of both RT and m/z by the use of
the functions matchValues
and MzRtParam
.
ft_match <- matchValues(
feat, stds_tb, param = MzRtParam(ppm = 10, toleranceRt = 5),
mzColname = c("mzmed", "mz"),
rtColname = c("rtmed", "RT")
)
ft_match
## Object of class Matched
## Total number of matches: 4
## Number of query objects: 734 (4 matched)
## Number of target objects: 8 (4 matched)
We can see that 4/8 compounds included in the standard mix 2 are observed in wine samples. Let’s see which compounds are they:
tb_match <- matchedData(ft_match)
tb_match <- tb_match[!is.na(tb_match$target_name),]
tb_match
## DataFrame with 4 rows and 20 columns
## mzmed mzmin mzmax rtmed rtmin rtmax npeaks
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT168 169.014 169.014 169.014 91.7997 90.9623 93.4497 26
## FT477 331.045 331.045 331.045 426.7226 425.3632 427.2577 16
## FT636 463.087 463.087 463.089 327.6402 324.7869 337.8187 16
## FT689 577.134 577.133 577.135 244.1201 242.8927 245.4458 12
## red wht X peakidx ms_level target_name
## <numeric> <numeric> <numeric> <list> <integer> <character>
## FT168 10 10 6 306,1023,1743,... 1 Gallic aci...
## FT477 10 0 6 620,1368,2055,... 1 Laricitrin
## FT636 7 2 6 530,1260,1967,... 1 Quercetin-...
## FT689 5 2 5 480,1209,2619,... 1 Procyanidi...
## target_formula target_mass target_mz target_RT score ppm_error
## <character> <numeric> <numeric> <numeric> <numeric> <numeric>
## FT168 C7H6O5 170.022 169.014 92.5009 -6.31429e-05 0.373595
## FT477 C16H12O8 332.053 331.046 425.6146 -8.03120e-04 2.426009
## FT636 C21H20O12 464.095 463.088 327.1548 -9.31859e-04 2.012271
## FT689 C30H26O12 578.142 577.135 242.1512 -1.59914e-03 2.770823
## score_rt
## <numeric>
## FT168 -0.701161
## FT477 1.108043
## FT636 0.485431
## FT689 1.968879
We can see that considering both m/z and RT values, gallic acid,
laricitrin, quercetin-3-galactoside and procyanidin B3 are observed in
the wine samples.
Below we plot the feature of gallic acid in study samples, indicating
with a vertical line the RT determined by the injection of the
corresponding analytical standard:
ft_chr <- featureChromatograms(raw_data_filled,
features = "FT168",
expandRt = 15,
filled = FALSE)
group_colors <- paste0(brewer.pal(3, "Set1"), "60")
names(group_colors) <- c("red","wht","X")
sample_colors <- group_colors[raw_data_filled$color]
plot(ft_chr, col = group_colors[raw_data_filled$color],
peakBg = sample_colors[chromPeaks(ft_chr)[, "sample"]])
abline(v = stds_tb$RT[stds_tb$name == "Gallic acid"], lwd = 10, col = "#D2D2D290")
Finally, let’s compare the MSMS pattern of gallic acid observed in the analtyical standard with the MSMS of this feature recorded within the study samples:
# get the MSMS of the standard:
ga_MSMS <- std_MSMS[[which(featureNames(std_MSMS) == rownames(MSMS_match_tb)[MSMS_match_tb$target_name == "Gallic acid"])]]
# get all MSMS in study samples with the m/z and RT of gallic acid:
MSMS_data <- readMSData(
files = raw_data_filled$filepath,
msLevel. = 2,
mode = "onDisk")
ga_MSMS_data <- MSMS_data %>%
filterPrecursorMz(169.0142, ppm = 10) %>%
filterRt(92.5 + 5 * c(-1, 1))
ga_MSMS_data
## MSn experiment data ("OnDiskMSnExp")
## Object size in memory: 0.04 Mb
## - - - Spectra data - - -
## MS level(s): 2
## Number of spectra: 10
## MSn retention times: 1:28 - 1:37 minutes
## - - - Processing information - - -
## Data loaded [Tue Nov 14 17:53:35 2023]
## Filter: select MS level(s) 2. [Tue Nov 14 17:53:35 2023]
## Filter: select by precursor m/z: 169.0142. [Tue Nov 14 17:53:35 2023]
## Filter: select retention time [87.5-97.5] and MS level(s), 2 [Tue Nov 14 17:53:35 2023]
## MSnbase version: 2.28.1
## - - - Meta data - - -
## phenoData
## rowNames: x005_red_ptnoir_A_1_NEG_DDA.mzML
## x009_wht_chardo_B_1_NEG_DDA.mzML ...
## x026_wht_chardo_C_1_NEG_DDA.mzML (7 total)
## varLabels: sampleNames
## varMetadata: labelDescription
## Loaded from:
## [1] x005_red_ptnoir_A_1_NEG_DDA.mzML... [7] x026_wht_chardo_C_1_NEG_DDA.mzML
## Use 'fileNames(.)' to see all files.
## protocolData: none
## featureData
## featureNames: F04.S0230 F08.S0212 ... F24.S0227 (10 total)
## fvarLabels: fileIdx spIdx ... spectrum (35 total)
## fvarMetadata: labelDescription
## experimentData: use 'experimentData(object)'
Within study samples there are a total of 10 spectras with a
precursor m/z value and RT similar to the one observed for gallic
acid.
Below we compare through a mirror plot the MSMS observed in study
samples (upper plot) versus the MSMS observe in the analytical standard
(bottomp lot, red color):
ggplotly(ggplot() +
geom_segment(aes(x = mz(ga_MSMS), xend = mz(ga_MSMS),
y = 0, yend = (intensity(ga_MSMS)/max(intensity(ga_MSMS)))*(-1)),
color = "red") +
geom_segment(aes(x = mz(ga_MSMS_data)[[1]], xend = mz(ga_MSMS_data)[[1]],
y = 0, yend = intensity(ga_MSMS_data)[[1]]/max(intensity(ga_MSMS_data)[[1]]))) +
ylab("relative intensity") + xlab("m/z") +
annotate(geom = "text", x = 160, y = 0.8, cex = 3,
label = ga_MSMS_data@phenoData@data[ga_MSMS_data[[1]]@fromFile, 1]) +
annotate(geom = "text", x = 160, y = -0.8, cex = 3,
label = basename(fileNames(std_MSMS)[ga_MSMS@fromFile]), color = "red")
)