A tibble containing the passenger of the TItanic along with their fate during the accident
## Rows: 891 Columns: 12
## ── Column specification ──────────────────────────────────────────────────────── ## Delimiter: "," ## chr (5): Name, Sex, Ticket, Cabin, Embarked ## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Titanic survival decision tree
Each node shows
Pros
Cons
In a decision tree the order of the variables and the split points are selected trying to maximize the “purity”.
A perfect split will separate the classes without misclassifications, leaving with absolutely “pure” groups.
Among the different criteria used to define the impurity of a node two stands out:
were \(p_{k}\) is the fraction of samples of class \(k\) in the node
As we discussed, decision trees have many good properties but show high variance
A good idea to circumvent this limitation could be to construct a large set of trees (build a Forest!) and to average their predictions to reduce variability.
To do that we need to somehow perturb our dataset, because building different trees on the same data will not change the results
from Wikipedia
This mixture of bootstrap and averaging is technically called bagging