Random Forests are a powerful algorithm used for both classification (what group does it belong to) and regression (what is its value). It is a supervised algorithm meaning that it must be trained and tested with datasets where both the features (the parameters used to form the predictive model) and the targets (the actual values) are known. ‘Forest’ indicates a collection of trees. Initially a definition of trees will be provided, followed by why the trees are considered random.

The Trees in the Forest

Each tree represents a method for answering a question (what class does it belong to or what is its value). The trees can be short and squat or they can be tall and skinny. The height (or depth) of a tree is determined by how many questions are asked regarding the features before an answer is obtained about the target. The tip of the tree will be the first question asked in a series of questions that lead to our answer. Each question is binary (only two possible responses) and causes a fork in the tree. While driving towards a conclusion, binary questions are answered from the training data. There are several methods to decide which questions to ask (how to split the data). A basic principle should be to ask questions that bring us closer to a conclusion. Each time a question is answered, the training data is split into subsets. Within each subset it is possible to ask which feature is best able to divide the subset by its associated target values. If the original question was ‘is this a motorcycle’ and the subset ‘isVehicleTF’ == True, and ‘hasEngineTF’ == True has been reached, then the feature ‘paintColor’ won’t help divide the data further. The feature ‘numWheels’ on the other hand will. It can be observed that splitting the data at ‘numWheels’ <= 2 nicely divides the subset into groups where the targets of each sub-subset have low dispersion (meaning most the members of ‘numWheels’ <= 2 have the target ‘isMotorcycle’ == True and most the members of ‘numWheels’ > 2 have the target value ‘isMotorcycle’ == False).

A few of the methods for determining the questions used to split the data are discussed in this presentation of decision tree classifiers in R. The Python package Scikit-learn uses the Gini method by default (which attempts to minimize dispersion within each fork) and also has an entropy method available. A minimum leaf size (leaf is the lowest level of the tree or the last question) and maximum tree depth (number of questions) can be set in most implementations to avoid overfitting the data. These parameters are also used in most Random Forest implementations. If they are set at the forest level they apply to every tree in the forest.

What make a Forest Random?

There are a lot of nice things about decision trees — but they are prone to over-fitting the training data. They’ll ask whatever questions of the training data that help them get to conclusions — but those questions may or may not be relevant to the test data or to future predictions. To compensate for this, decision trees are often used in groups (ensembles). Each tree in the forest will get a vote (classification) or its prediction will be included in an average (regression).

The Random Subspace Method randomizes which features are included. This means that each tree in the forest goes through a training process but that it has access to different features so it will reach different conclusions. This process is important if one or more features are highly correlated with the target in the training data since it gives a voice to other features which may be important in a future dataset.

Bootstrap aggregating, or bagging, is similar except that instead of replacing some of the features, some (~37%) of the records are replaced by duplicates. (Statistical basis explained the linked articles). If one or more anomalous records are causing overfitting of the data those records won’t be in all of the trees so the aggregate is less likely to be affected. Both of these bagging techniques are done with replacement and the Random Forest algorithm uses both techniques.

How Does the Algorithm Actually Work?

To write a function to train a random forest a method can be written for the steps described above (trainDecisionTree, bootStrapAggregate, randomSubspace).

  • Decide on a number record sets and call bootStrapAggregate in a loop based on numRecordSets. The sizes of the modified datasets for each loop are unchanged but in each set about 37% of the records have been replaced randomly with duplicates.
  • Decide on a number of feature groupings (typically √p where p is the number of features). Within the bootStrapAggregate method we call randomSubspace (passing the bootStrapAggregate dataset) numFeatureGroups times. The sizes of the modified datasets are unchanged but in each ~37% of the features have been randomly replaced with duplicates.
  • For each of the (numRecordSets * numFeatureGroups) combinations we have a dataset that has been modified twice and we will call trainDecisionTree with that dataset.
  • At the end we will take the mode (classification) or the mean (regression) vote for all the trees as our prediction

What inputs should be passed to a Random Forest method?

It should be noted that insufficient data and/or poor data preparation are the most likely culprits when a model isn’t performing as desired. However, the parameters chosen can make a difference. Some of the hyperparameters that can be optimized are:

· Feature set groupings

· The number of record set groupings

· The minimum leaf size

· The maximum tree depth

· The methods for choosing how to split the data

If we code our own implementation we may even gain a greater diversity of trees by randomizing between multiple acceptable split models (gini, entropy, etc). However, Python or R both provide several Random Forest implementations that might be sufficient for most use cases.

If you have any questions about Random Forest algorithms or any other Data Science problem feel free to reach out on LinkedIn. I love hearing from fellow data scientists.