Synthetic_Dataset

Introduction

The objective of this document is to show how the synthetic dataset included in the gASCA package was created. In this simulation, 15 count variables are obtained as a function of 3 different treatments at 5 points in time. These variables could represent, for instance, count data from 10 different invertebrate species found on plants at Time (A) 0, 1, 2, 3, 4 after the plants have been treated with one of three treatments (B1, B2, B3). For each combination of time and treatment, K = 10 replicates (k = 1…K) were simulated considering the model:

Y = g⁻¹(H₀+H_A+H_B+H_AB) + EPois

This is further decomposed in term-specific latent variables. A set of loadings is then used to associate each latent variable to a group of the measured variables:

Y = g⁻¹(H₀+T_AP_A^T+T_BP_B^T+T_ABPAB^T) + E_*Poi**s*

In particular.

Two principal components are used to describe the variation in matrix H_A (time). The variation described by the first component is only present in the first two variables, and the variation of the second component is only present in variables 3 and 4. PC1 has a continuous upward trend in time while PC2 increases in the first three time steps and then decreases again.
One principal component is used to describe the treatment effect in matrix H_B (treatment) and this variation is only present in variables 5 and 6.
Two components are used for the interaction effect in matrix H_*A**B*: PC1 is only active in variables 7 and 8 and shows a disagreement between treatment B3 and the other two treatments, while PC2 variation, only assigned to variables 9 and 10, represents a deviation from treatment B2 compared to the other two treatments.
Five additional variables not responding to the design are added to the dataset.

ntime <- 5   # no of timepoints
ntreat <- 3  # no of treatments
nrepl <- 10   # no of replicates

## number of samples
nSamp = ntime * ntreat * nrepl

Factor A: time

## The design a matrix for the first factor is the following
contrA <- contr.sum(ntime)

contrA

##   [,1] [,2] [,3] [,4]
## 1    1    0    0    0
## 2    0    1    0    0
## 3    0    0    1    0
## 4    0    0    0    1
## 5   -1   -1   -1   -1

For each time point we have to allocate three treatments and 10 replicates

## since we have to allocate for each time point three treatment levels and 
## three replicates
XA <- contrA %x% rep(1,ntreat)
XA <- XA %x% rep(1,nrepl)

Let’s now plot the matrix

XA %>% 
  melt() %>% 
  ggplot() + 
  geom_raster(aes(x = Var2, y = -Var1, fill = factor(value))) + 
  scale_fill_brewer(palette = "Set1", name = "Value") + 
  theme_void()

The previous design matrix is used on the two time eigentrends, which will be now organized in a matrix. Remember that since we use sum coding for the contrasts we need only four level for the 5 timepoints:

A_LV1 <- c(-2,-1,0,1)
A_LV2 <- c(-1,0.5,1,0.5)

WA <- cbind(A_LV1,A_LV2)

The final score matrices are obtained combining WA and XA

TA <- XA %*% WA

Let’s visualize the profile of the two eigentrends

TA %>% 
  data.frame() %>% 
  add_column(time = rep(c("t0","t1","t2","t3","t4"), each = 30)) %>% 
  add_column(treatment = rep(rep(c("B1","B2","B3"), each = 10), times = 5)) %>%  
  pivot_longer(contains("LV")) %>% 
  ggplot() + 
  geom_point(aes(x = time, y = value, col = treatment), alpha = 0.5, size = 1) + 
  geom_line(aes(x = time, y = value, col = treatment, lty = treatment, group = treatment)) + 
  facet_grid(treatment~name) + 
  theme_bw() +
  theme(aspect.ratio = 0.5)

Factor B: treatment

As before we start from the design matrix with sum contrasts and 3 levels (B1,B2,B3)

contrB <- contr.sum(ntreat)
## we then have n time points and the replicates
XB <- rep(1,ntime) %x% contrB
XB = XB %x% rep(1,nrepl)

XB %>% 
  melt() %>% 
  ggplot() + 
  geom_raster(aes(x = Var2, y = -Var1, fill = factor(value))) + 
  scale_fill_brewer(palette = "Set1", name = "Value") + 
  theme_void()

Now we define the effect of the treatment which has only one LV

## this is the effect of the treatment, only one component
WB <- c(-2, 1.5)

As before we “populate” the design matrix with the treatment levels

TB <- XB %*% WB

And we plot the results

TB %>% 
  data.frame() %>% 
  add_column(time = rep(c("t0","t1","t2","t3","t4"), each = 30)) %>% 
  add_column(treatment = rep(rep(c("B1","B2","B3"), each = 10), times = 5)) %>%  
  ggplot() + 
  geom_point(aes(x = time, y = ., col = treatment), alpha = 0.5, size = 3) + 
  geom_line(aes(x = time, y = ., col = treatment, group = treatment)) + 
  scale_color_brewer(palette = "Set1",  name = "Eigentrends") + 
  theme_bw()

Interaction Term

The design matrix of the interaction term is obtained by a combination of XA and XB, each sample will be assigned to a combination of the levels of A and of B

## now we have the interaction term
XAB <-  cbind(XA[,1]*XB[,1], 
            XA[,1]*XB[,2], 
            XA[,2]*XB[,1], 
            XA[,2]*XB[,2], 
            XA[,3]*XB[,1], 
            XA[,3]*XB[,2], 
            XA[,4]*XB[,1], 
            XA[,4]*XB[,2])

and this is the plot

XAB %>% 
  melt() %>% 
  ggplot() + 
  geom_raster(aes(x = Var2, y = -Var1, fill = factor(value))) + 
  scale_fill_brewer(palette = "Set1", name = "Value") + 
  theme_void()

## these are the impact of the interaction on the two  components
AB_LV1 <-  3*c(0.4, 0.4, 0.2, 0.2, -0.2, -0.2, -0.4, -0.4)
AB_LV2 <-  c(-0.2,0.4, 0.2, -0.4, -0.2, 0.4, 0.2, -0.4)
WAB <- cbind(AB_LV1, AB_LV2)

TAB <- XAB%*%WAB

TAB %>% 
  data.frame() %>% 
  add_column(time = rep(c("t0","t1","t2","t3","t4"), each = 30)) %>% 
  add_column(treatment = rep(rep(c("B1","B2","B3"), each = 10), times = 5)) %>%  
  pivot_longer(contains("LV")) %>% 
  ggplot() + 
  #geom_point(aes(x = time, y = value, col = treatment), alpha = 0.5, size = 3) +
  geom_point(aes(x = time, y = value, col = treatment), alpha = 0.5, size = 3, 
             position=position_jitterdodge(dodge.width = 0.1, jitter.width = 0.1)) +
  
  geom_line(aes(x = time, y = value, col = treatment, lty = treatment, group = treatment),
            position=position_jitterdodge(dodge.width = 0.1, jitter.width = 0.1) ) + 
  facet_wrap(~name) + 
  scale_fill_brewer(palette = "Set1",  name = "Eigentrends") + 
  xlab("Sample Index") + 
  theme_bw() + 
  theme(aspect.ratio = 0.5)

According to the initial plan, LV1 reflects a temporal change in the difference between treatment B3 and the other treatments; LV2, instead, reflects a difference between B2 and the other two treatments which changes with time.

Project on the species

Now we will use a set of loadings to associate the LV of each term to the 10 measured species. The loadings will be also normalized to 1. As discussed before, we also add 5 species which are not responding to the design.

## here we use "dummy loadings" to move from the LVs to the 15 "measured" variables

## LV1 of the time factor goes on the first two species
PA1 <- c(1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
PA1 <- PA1/sqrt(sum(PA1%*%PA1)) ## normalize  

## LV2 of the time factor goes on s3 and s4
PA2 <- c(0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
PA2 <- PA2/sqrt(sum(PA2^2))

## here the loadings matrix for time
PA <- cbind(PA1, PA2)

## LV1 for the treatment factor goes to 
PB <- c(0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
PB <- PB/sqrt(sum(PB^2))

## LV1 for the interaction goes to variables s7, s8
PAB1 <- c(0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)
PAB1 <- PAB1/sqrt(sum(PAB1^2))
## LV2 for the interaction goes to variables s9, s10
PAB2 <- c(0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0)
PAB2 <- PAB2/sqrt(sum(PAB2^2))
PAB <- cbind(PAB1, PAB2)

Let’s plot the loadings

cbind.data.frame(PA,PB,PAB) %>% 
  add_column(species = paste0("s",1:15)) %>% 
  mutate(species = factor(species, levels = unique(species))) %>% 
  pivot_longer(starts_with("p")) %>% 
  mutate(name = factor(name, levels = c("PA1","PA2","PB","PAB1","PAB2"))) %>% 
  ggplot() + 
  geom_col(aes(x = species, y = value, fill = name)) +
  facet_wrap(~name, ncol = 1) + 
  theme_bw() +
  theme(aspect.ratio = 0.3)

Now we create the matrices containing the expected values (remember that we are working on the linear predictor space)

A <- TA%*%t(PA)
B <- TB%*%t(PB)
AB <- TAB%*%t(PAB)

To reduce the number of zeroes we create a matrix with the grand mean for the different species.

## Finally we need the grand mean
mu <- rep(1, ntreat*ntime*nrepl) %*% t(c(2, 3, 4, 5, 6, 6, 5, 4, 3, 2, 6, 5, 8, 5, 3))

Now we combine everything, back transform values in the “count” space and add poisson noise.

## we add everything up and make the inverse link
logy <- mu + A + B + AB 
y <- exp(logy)

ycount <- apply(y,2, function(c) rpois(n = length(c), lambda = c))

To finish with the synthetic dataset we need to prepare the dataframe with the design.

d <- data.frame(time = rep(c("t0","t1","t2","t3","t4"), each = 30),
                treatment = rep(rep(c("B1","B2","B3"), each = 10), times = 5))
colnames(ycount) <- paste0("s",1:15)


synth_count_data <- list(counts = ycount, design = d)

Save the data in the package folder.

save(synth_count_data, file = "../../data/synth_count_data.RData", version = 2)

Figures

We finally use patchwork to produce the figures for the paper.

figure_1a <- cbind.data.frame(PA,PB,PAB) %>% 
  add_column(species = paste0("s",1:15)) %>% 
  mutate(species = factor(species, levels = unique(species))) %>% 
  pivot_longer(starts_with("p")) %>% 
  mutate(name = factor(name, levels = c("PA1","PA2","PB","PAB1","PAB2"),
                       labels = c("time_LV1", "time_LV2", "treatment_LV1", 
                                  "time:treat_LV1","time:treat_LV2" ))) %>% 
  ggplot() + 
  geom_col(aes(x = species, y = value), fill = "steelblue") +
  facet_wrap(~name, ncol = 1) + 
  ggtitle("Loadings") + 
  theme_bw() +
  theme(aspect.ratio = 0.3, 
        legend.position = "none",
        axis.text.x = element_text(angle = 45, vjust = 1, hjust=1))

figure_1b <- cbind.data.frame(TA,B_LV1 = TB,TAB) %>% 
  add_column(time = rep(c("t0","t1","t2","t3","t4"), each = 30)) %>% 
  add_column(treatment = rep(rep(c("B1","B2","B3"), each = 10), times = 5)) %>% 
  pivot_longer(contains("LV")) %>% 
  mutate(name = factor(name, levels = c("A_LV1","A_LV2","B_LV1","AB_LV1","AB_LV2"),
                       labels = c("time_LV1", "time_LV2", "treatment_LV1", 
                                  "time:treat_LV1","time:treat_LV2" ))) %>%
  ggplot() + 
  geom_point(aes(x = time, y = value, col = treatment), 
             position=position_jitterdodge(dodge.width = 0.1, jitter.width = 0.1), alpha = 0.5, size = 2) + 
  geom_line(aes(x = time, y = value, col = treatment, group = treatment, lty = treatment),
            position=position_jitterdodge(dodge.width = 0.1, jitter.width = 0.1)) + 
  facet_wrap(~name, ncol = 1) + 
  scale_color_brewer(palette = "Set1") + 
  ggtitle("LVs") + 
  theme_bw() + 
  theme(aspect.ratio = 0.3,
        legend.position = "left")

Join the figure with patchwork