image from https://en.wikipedia.org/wiki/Random_forest
In this post we will review usage of a random forest classifier in python.
We use a very simple CSV as input. In real life you will have many columns, and complex data.
height,weight,person
80,40,child
70,30,child
50,10,child
180,80,adult
170,80,adult
185,80,adult
First we load the CSV to a data frame, and print its head.
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
df = pd.read_csv("input.csv")
print(df.head(5))
The random forest works with floats, both on features, and on labels. Hence we convert the person column to an int label:
def convert_to_int(row):
if row['person'] == 'adult':
return 1
return 0
df['is_adult'] = df.apply(lambda row: convert_to_int(row), axis=1)
df.drop(labels=['person'], axis=1, inplace=True)
Next we split the data to training and testing segments:
labels = np.array(df['is_adult'])
features = df.drop('is_adult', axis=1)
feature_list = list(features.columns)
features = np.array(features)
train_features, test_features, train_labels, test_labels = \
train_test_split(features,
labels,
test_size=0.25,
random_state=42,
)
print('features shape {} labels shape {}'.format(
train_features.shape, train_labels.shape))
print('features shape {} labels shape {}'.format(
test_features.shape, test_labels.shape))
with np.printoptions(threshold=np.inf):
print(train_features)
print(train_labels)
Let's examine a dummy model as a baseline. This model always guess that we have a child, and not an adult.
baseline_predictions = np.full(test_labels.shape, 0)
baseline_errors = abs(baseline_predictions - test_labels)
with np.printoptions(threshold=np.inf):
print("baseline predictions", baseline_predictions)
print("baseline errors",baseline_errors)
print('error baseline {}'.format(
round(np.mean(baseline_errors), 3)))
Now let create the random forest classifier, and check its error rate.
forest = RandomForestRegressor(n_estimators=1000, random_state=42)
forest.fit(train_features, train_labels)
predictions = forest.predict(test_features)
prediction_threshold = 0.5
predictions[predictions < prediction_threshold] = 0
predictions[predictions >= prediction_threshold] = 1
with np.printoptions(threshold=np.inf):
print(predictions)
prediction_errors = predictions - test_labels
print('error for test {}'.format(
round(np.mean(abs(prediction_errors)), 3), 'degrees.'))
We can check the importance of each feature in the model:
importances = list(forest.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in
zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
for pair in feature_importances:
print('variable: {} Importance: {}'.format(*pair))
Lastly, we can examine true/false positive/negative rate:
joined = np.stack((predictions, test_labels), axis=1)
tp = joined[np.where(
(joined[:, 0] == 1) *
(joined[:, 1] == 1)
)]
tn = joined[np.where(
(joined[:, 0] == 0) *
(joined[:, 1] == 0)
)]
fp = joined[np.where(
(joined[:, 0] == 1) *
(joined[:, 1] == 0)
)]
fn = joined[np.where(
(joined[:, 0] == 0) *
(joined[:, 1] == 1)
)]
print('true positive {}'.format(np.shape(tp)[0]))
print('true negative {}'.format(np.shape(tn)[0]))
print('false positive {}'.format(np.shape(fp)[0]))
print('false negative {}'.format(np.shape(fn)[0]))