library(tidyverse)
library(randomForest)
The objective of this demo, and of the subsequent DIY, is to make you familiar RandomForest, by applying it to two classification tasks of metabolomics data.
These are the steps we will follow
A couple of preliminary messages
rfImpute
function in the package!). It would be for sure interesting to compare rf imputation with our approach.This we already know …
load("../Day2/KOMP_data_targeted.RData")
We have now three files
DM <- DM %>%
as.data.frame() %>%
column_to_rownames("CompoundID") %>%
t(.) %>%
as_tibble(rownames = "Sampname") %>%
separate(Sampname, c("One","MouseID"),"_") %>% ## split the column
select(-One) %>% ## remov the first piece
mutate(MouseID = as.numeric(MouseID))
myimputer <- function(v){
if (sum(is.na(v)) == 0) {
return(v)
} else {
napos <- which(is.na(v))
newval <- runif(length(napos),0,min(v, na.rm = TRUE))
out <- v
out[napos] <- newval
return(out)
}
}
Now we apply it to the full set of columns
DM_i <- DM %>%
mutate(across(starts_with("CPD"), ~myimputer(.x)))
I’m formatting the data to be used for a RandomForest analysis, so I should give a look to the help of the package to be sure my data are OK.
The absence of a coherent way to perform machine learning in R is a typical situation one have to face. If you are interested on a specific method this is not a huge problem, but when you would like to try different methods and check which is the best one the situation starts to be tricky. The community has (and is) developing several solutions to cope with that.
We need a df with the column with the factor we want to predict…
rf_df <- DM_i %>%
left_join(sample_meta_data) %>%
select(c(Gender,starts_with("CPD"))) %>%
mutate(Gender = factor(Gender))
The situation with the samples is the following
table(rf_df$Gender)
##
## Female Male
## 110 110
We need to split the data in training and test sets, doing it completely at random way could result in the unfortunate situation of having a test set composed mainly (if not exclusively!) of samples belonging to a single class.
To avoid that we need to use stratified sampling, a concept we already introduced in the first day of the course.
Here the situation is easy, since we have two perfectly balanced sample groups.
In general is preferred to split the data in training an test keeping more samples in the training set. This is done in order to be able to construct a model with less bias. More training samples will result in a model constructed with more data !
In our example we will partition the data in leaving the 10% of the samples in the test set.
set.seed(123)
## let's fo it with brute force
work_df <- rf_df %>%
nest(data = starts_with("CPD")) %>%
mutate(testid = map(data, ~sample(1:nrow(.x),11))) %>%
mutate(training = map2(data,testid, ~.x %>% slice(-.y))) %>%
mutate(test = map2(data,testid, ~.x %>% slice(.y)))
## now we build the training and test dfs
training_df <- work_df %>%
select(Gender,training) %>%
unnest(training)
test_df <- work_df %>%
select(Gender,test) %>%
unnest(test)
Now we are ready to train the RF
rfmodel <- randomForest(Gender~., data = training_df)
rfmodel
##
## Call:
## randomForest(formula = Gender ~ ., data = training_df)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 5.05%
## Confusion matrix:
## Female Male class.error
## Female 94 5 0.05050505
## Male 5 94 0.05050505
Let’s now dissect the output:
Type of random forest: classification
: tells that we are doing a classificationNumber of trees: 500
: this is the number of trees of our forest. 500 is the defaultNo. of variables tried at each split: 7
: this is the number of variables which is randomly taken into consideration at each split of each nodeOOB estimate of error rate
: since we do bootstrap on the samples, some of them are already left out (out-of-bag) and are used as an independent measure of the error rate on the training setConfusion Matrix
: this split up the error rate in the two classes showing the class-specific misclassification rateApparently our RF is working well! This should be not unexpected considering the results of the univariate analysis we performed on Day2.
Let’s look if the model parameters were correctly set.
For ntres
the training process itself can be used to check if it was necessary to increase the number of trees. The evolution of the OOB errors as the “forest” grows from 1 to 500 trees are stores inside the model
head(rfmodel$err.rate)
## OOB Female Male
## [1,] 0.14102564 0.20000000 0.07894737
## [2,] 0.09836066 0.08620690 0.10937500
## [3,] 0.12751678 0.11111111 0.14285714
## [4,] 0.12865497 0.09302326 0.16470588
## [5,] 0.16022099 0.14285714 0.17777778
## [6,] 0.12169312 0.10526316 0.13829787
Already from this visualization we can see that the error rate decreases as the forest grows. A plot can be helpful
matplot(rfmodel$err.rate, type = "l", col = c("black","red","steelblue"),
xlab = "Number of trees",
ylab = "Error Rate",
lwd = 2, lty = 1)
legend("topright", legend = colnames(rfmodel$err.rate), col = c("black","red","steelblue"), lty = 1)
As you can see the error rate stabilizes already with a forest of 150 trees. There is no point, then, to change the default value of this parameter in our model.
The testing of mtry
have to be done by hand
## create a testing tibble with the values
mtrytest <-tibble(mtry = 1:25) %>%
mutate(rfs = map(mtry, ~randomForest(Gender~., data = training_df, mtry = .x))) ## run the battery of models
now we have to get out the error rates and plot them
mtrytest %>%
mutate(err = map(rfs, ~.x$err.rate[500,])) %>%
unnest_wider(err) %>%
select(-rfs) %>%
pivot_longer(-mtry) %>%
ggplot(aes(x = mtry, y = value, col = name)) +
geom_line() +
geom_point() +
geom_vline(xintercept = 7, col = "red", lty = 2) +
theme_light()
Apparently the minimum of the error rate is obtained with mtry
set to 6
bestrf <- randomForest(Gender~., data = training_df, mtry = 6)
bestrf
##
## Call:
## randomForest(formula = Gender ~ ., data = training_df, mtry = 6)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 6.06%
## Confusion matrix:
## Female Male class.error
## Female 92 7 0.07070707
## Male 5 94 0.05050505
Now is time to look to the prediction on the test set
pred <- predict(bestrf,test_df)
And now we can calculate the confusion matrix
cm <- table(test_df$Gender, pred)
cm
## pred
## Female Male
## Female 11 0
## Male 0 11
If you look to these predictions the error rates are higher compared to what we observed for the training set. This is not strange, but remember that this is the fair way of evaluating the goodness of your model.
As we discusses at the beginning, we would also need an estimate of the variability in the prediction error, and to do that all our approach should be included in a repeated validation scheme.
The last step of our analysis is to identify the variables which are more important to differentiate males and females. To do that, we first construct a model on all the data (to have the minimum possible bias), and then we look to the variable importance in that model
full_model <- randomForest(Gender~., data = rf_df, mtry = 8)
full_model
##
## Call:
## randomForest(formula = Gender ~ ., data = rf_df, mtry = 8)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 5.45%
## Confusion matrix:
## Female Male class.error
## Female 105 5 0.04545455
## Male 7 103 0.06363636
full_model$importance %>%
as_tibble(rownames = "CompoundID") %>%
left_join(metabolite_meta) %>%
arrange(`MeanDecreaseGini`) %>%
mutate(CompoundName = factor(CompoundName, levels = CompoundName)) %>%
ggplot() +
geom_point(aes(x = CompoundName, y= `MeanDecreaseGini`, col = Assay), size = 2) +
geom_segment(aes(x = CompoundName, yend = `MeanDecreaseGini`, col = Assay, xend = CompoundName, y = 0)) +
scale_color_brewer(palette = "Set2") +
coord_flip() +
theme_light()
Which gives a picture slightly different from our old univariate analysis. Remember, however, that RF is a supervised multivariate classifier, not a univariate approach …