Trees and Forests

Decision Trees

Titanic Data

A tibble containing the passenger of the TItanic along with their fate during the accident

## Rows: 891 Columns: 12

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare

## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Decision Trees

Titanic survival decision tree

Each node shows

the predicted class (died or survived),
the predicted probability of survival,
the percentage of observations in the node.

Pros anc Cons

Pros

Trees are versatile (they can combine different types of variables)
No problems with variable scaling
Interpretable by definition

Cons

High variance i.e. they do not always do well on the test set …

Selecting the split point: impurity

In a decision tree the order of the variables and the split points are selected trying to maximize the “purity”.

A perfect split will separate the classes without misclassifications, leaving with absolutely “pure” groups.

Among the different criteria used to define the impurity of a node two stands out:

the Gini index: \(\sum_{i\neq{j}}p_{i}p_{j}\)
the entropy : \(-\sum_{j}p_{j}log(p_{j})\)

were \(p_{k}\) is the fraction of samples of class \(k\) in the node

Best Split on the Age

Random Forest

As we discussed, decision trees have many good properties but show high variance

A good idea to circumvent this limitation could be to construct a large set of trees (build a Forest!) and to average their predictions to reduce variability.

To do that we need to somehow ~~perturb~~ our dataset, because building different trees on the same data will not change the results

perturb the dataset by bootstrapping
use only a subset of the variables at each split
build a large number of simple trees and “average” their outcomes

The majority takes all

from Wikipedia

Bagging

This mixture of bootstrap and averaging is technically called ~~bagging~~

Most Relevant Model Parameters

ntree: Number of trees to grow. Larger number of trees produce more stable models and covariate importance estimates, but require more memory and a longer run time. For small datasets, 50 trees may be sufficient. For larger datasets, 500 or more may be required.
mtry: Number of variables available for splitting at each tree node. For classification models, the default is the square root of the number of predictor variables (rounded down). For regression models, it is the number of predictor variables divided by 3 (rounded down).

Advantages

Error estimate: due to bootstrapping some of the sample are left out (technically they are out-of-bag), so an error estimate on an independent set of samples comes for free
Interpretability: even if in a less sample way the importance of each variables can be assessed as in decision trees
Low sensitivity to model parameters: it appears that in practice RF is robust to the change of settings