More on this book
Kindle Notes & Highlights
by
Chris Smith
Read between
June 24 - June 27, 2021
It's natural for us to fall into set mindsets or paths, and decision trees are great tool that can help us step outside of these paths and consider new ideas or circumstances.
Decision trees fall within the supervised learning category, which means that it is an algorithm trained on data that contains labelled examples.
how does a decision tree decide which attribute (node) to split? The answer involves involves two major concepts: entropy and information gain.
Information gain can be defined in many ways*, but in our context it simply measures the impurity of a group of examples.
"If a particular node is split, does it increase or decrease entropy?" To answer this question, the formula measures the difference in entropy before the split and after the split, and then analyzes the result.
The larger the information gain, the less the impurity, and that is what the decision tree algorithm always selects. Remember, the algorithm's goal is to predict either "yes - watch a movie" or "no - not watch a movie", and so it always selects the route that will help it arrive at its goal the fastest.
A greedy algorithm is a mathematical process that looks for fast solutions to complex problems. A decision tree with hundreds of attributes and subsets is an excellent example of a complex problem that cannot be easily (or quickly) solved.
Practically speaking, this means that a decision tree algorithm selects the best attribute to split but does not backtrack to test other possibilities or reconsider other scenarios.
Overfitting occurs when an algorithm is trained so tightly on specific examples that it cannot correctly understand and work with new examples that are introduced to it.
The ability to work with new test examples and correctly classify them is technically called "generalizing", and trees that over fit cannot generalize well.
A random forest is a machine learning algorithm that makes use of multiple decision trees to predict a result, and this collection of trees is often called an ensemble.
selecting random examples from the original training dataset and recreating the training dataset for each tree. Technically, this is called bootstrapping,
Random Forests are unique in this matter because they have a built in dataset that can be used to test accuracy while it is run. This is called the Out-of-Bag Error Estimate (also called O.O.B), and it is a method for testing the accuracy of random forests, boosted decision trees, and other machine learning algorithms.
If the values of a single attribute are permuted, or tweaked, and the accuracy of the tree does not change much, this tells us that the attribute is not very important.
In contrast, if the accuracy does fluctuate, this tells us that the attribute has a certain level of importance.
Knowing this, you could choose to do the following: Remove the "weak" performers from your dataset. In this case, you could start with removing "Year", then "Country, etc…, and see how it affects your accuracy and algorithm speed. Consider adding more "Attributes" that are similar to the "strong performers". For example, what about the co-star? Or a secondary category?