Introduction
I was a little bothered by the manual parameter tuning done in my random forest classification, so I decided to look into automated techniques. So returning to my old project, I decided to find better methods of improving the model. An effective method I found was grid search. K-Fold Cross Validation was also used, but only to see how wild the variation can be with training.
Code
The code from my old project was kept the exact same. The only difference is the addition of the following code blocks:
# Applying k-Fold Cross Validation
from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10)
mean = accuracies.mean()
std = accuracies.std()
Here, cross fold validation was ran 10 iterations of the model, and gathered both the mean and standard deviation as shown below.
The outcome of the results were good, however I wanted to see if it could be better. This is where grid search came into play.
# Applying Grid Search to find the best model and the best parameters
from sklearn.model_selection import GridSearchCV
parameters = [{'n_estimators': [1, 5, 10, 50], 'criterion': ['entropy', 'gini'], 'max_depth':[1,5,10,50], 'bootstrap':['True','False'], 'warm_start':['True','False'] }]
grid_search = GridSearchCV(estimator = classifier,
param_grid = parameters,
scoring = 'accuracy',
cv = 10,
n_jobs = -1)
grid_search = grid_search.fit(X_train, y_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_
Other than testing things I normally would change, such as number of trees, I decided to explore the other arguments that I normally leave out. I used the following reasoning for each argument:
- n_estimators - required to allow the classifier to function
- criterion - tests the loss quality
- max_depth - changes how much information is captured through each split
- bootstrap - retries combinations in hopes that some are significant
- warm_start - utilizes previous solutions and adjusts the trees accordingly
Results
Grid search is an marvelous optimization tool. According to the results, the parameters that I normally would’ve left out actually played a role in finding a more accurate result. Although great, the accuracy started to get really close to overfitting numbers, so the information gathered were taken with a grain of salt.