While working on the Kaggle “House Prices” challenge, I came across this neat metric called “out-of-bag evaluation” (OOB). It’s a way to check the accuracy of a Random Forest model without needing extra data, similar to cross-validation.
Idea: To begin with, Random forest is a collection of Decision trees. When each “tree” in the Random Forest is built, only a portion of the training data is used. This leaves the rest of the training data for mini testing of that specific tree. The table below illustrates how this works:
In the above example, we have 6 examples in training data, with which a Random Forest with 3 trees is built. Each tree is made using 6 examples, where each example is sampled from the training data with replacement.
- Tree 1: Uses all houses except house #3, so we can test the tree on house #3.
- Tree 2: Uses all houses except house #2, #4, and #6, so we can test the tree on house #2, #4, and #6.
- Tree 3: Uses all houses except house #1 and #5, so we can test the tree on house #1 and #5.
This way, each tree gets evaluated on data it hasn’t seen before, giving us a reliable estimate of how well our model will do in the real world.
Hope this article helped! Any relevant feedback is appreciated!