Decision Trees

Titanic Data

A tibble containing the passenger of the TItanic along with their fate during the accident

## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Decision Trees

Titanic survival decision tree

Each node shows

  • the predicted class (died or survived),
  • the predicted probability of survival,
  • the percentage of observations in the node.

Pros anc Cons

Pros

  • Trees are versatile (they can combine different types of variables)
  • No problems with variable scaling
  • Interpretable by definition

Cons

  • High variance i.e. they do not always do well on the test set …

Selecting the split point: impurity

In a decision tree the order of the variables and the split points are selected trying to maximize the “purity”.

A perfect split will separate the classes without misclassifications, leaving with absolutely “pure” groups.

Among the different criteria used to define the impurity of a node two stands out:

  • the Gini index: \(\sum_{i\neq{j}}p_{i}p_{j}\)
  • the entropy : \(-\sum_{j}p_{j}log(p_{j})\)

were \(p_{k}\) is the fraction of samples of class \(k\) in the node

Best Split on the Age

Random Forest

Random Forest

As we discussed, decision trees have many good properties but show high variance

A good idea to circumvent this limitation could be to construct a large set of trees (build a Forest!) and to average their predictions to reduce variability.

To do that we need to somehow perturb our dataset, because building different trees on the same data will not change the results

  • perturb the dataset by bootstrapping
  • use only a subset of the variables at each split
  • build a large number of simple trees and “average” their outcomes

The majority takes all

from Wikipedia

Bagging

This mixture of bootstrap and averaging is technically called bagging

Most Relevant Model Parameters

  • ntree: Number of trees to grow. Larger number of trees produce more stable models and covariate importance estimates, but require more memory and a longer run time. For small datasets, 50 trees may be sufficient. For larger datasets, 500 or more may be required.
  • mtry: Number of variables available for splitting at each tree node. For classification models, the default is the square root of the number of predictor variables (rounded down). For regression models, it is the number of predictor variables divided by 3 (rounded down).

Advantages

  • Error estimate: due to bootstrapping some of the sample are left out (technically they are out-of-bag), so an error estimate on an independent set of samples comes for free
  • Interpretability: even if in a less sample way the importance of each variables can be assessed as in decision trees
  • Low sensitivity to model parameters: it appears that in practice RF is robust to the change of settings