Wednesday, December 22, 2021

Using Random Forest in Python

 

image from https://en.wikipedia.org/wiki/Random_forest



In this post we will review usage of a random forest classifier in python.


We use a very simple CSV as input. In real life you will have many columns, and complex data.



height,weight,person
80,40,child
70,30,child
50,10,child
180,80,adult
170,80,adult
185,80,adult



First we load the CSV to a data frame, and print its head.



import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

df = pd.read_csv("input.csv")
print(df.head(5))



The random forest works with floats, both on features, and on labels. Hence we convert the person column to an int label:



def convert_to_int(row):
if row['person'] == 'adult':
return 1
return 0


df['is_adult'] = df.apply(lambda row: convert_to_int(row), axis=1)
df.drop(labels=['person'], axis=1, inplace=True)



Next we split the data to training and testing segments:



labels = np.array(df['is_adult'])
features = df.drop('is_adult', axis=1)
feature_list = list(features.columns)
features = np.array(features)
train_features, test_features, train_labels, test_labels = \
train_test_split(features,
labels,
test_size=0.25,
random_state=42,
)
print('features shape {} labels shape {}'.format(
train_features.shape, train_labels.shape))
print('features shape {} labels shape {}'.format(
test_features.shape, test_labels.shape))

with np.printoptions(threshold=np.inf):
print(train_features)
print(train_labels)



Let's examine a dummy model as a baseline. This model always guess that we have a child, and not an adult.



baseline_predictions = np.full(test_labels.shape, 0)
baseline_errors = abs(baseline_predictions - test_labels)

with np.printoptions(threshold=np.inf):
print("baseline predictions", baseline_predictions)
print("baseline errors",baseline_errors)

print('error baseline {}'.format(
round(np.mean(baseline_errors), 3)))



Now let create the random forest classifier, and check its error rate.



forest = RandomForestRegressor(n_estimators=1000, random_state=42)
forest.fit(train_features, train_labels)

predictions = forest.predict(test_features)

prediction_threshold = 0.5
predictions[predictions < prediction_threshold] = 0
predictions[predictions >= prediction_threshold] = 1
with np.printoptions(threshold=np.inf):
print(predictions)

prediction_errors = predictions - test_labels
print('error for test {}'.format(
round(np.mean(abs(prediction_errors)), 3), 'degrees.'))



We can check the importance of each feature in the model:



importances = list(forest.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in
zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
for pair in feature_importances:
print('variable: {} Importance: {}'.format(*pair))



Lastly, we can examine true/false positive/negative rate:



joined = np.stack((predictions, test_labels), axis=1)
tp = joined[np.where(
(joined[:, 0] == 1) *
(joined[:, 1] == 1)
)]
tn = joined[np.where(
(joined[:, 0] == 0) *
(joined[:, 1] == 0)
)]
fp = joined[np.where(
(joined[:, 0] == 1) *
(joined[:, 1] == 0)
)]
fn = joined[np.where(
(joined[:, 0] == 0) *
(joined[:, 1] == 1)
)]
print('true positive {}'.format(np.shape(tp)[0]))
print('true negative {}'.format(np.shape(tn)[0]))
print('false positive {}'.format(np.shape(fp)[0]))
print('false negative {}'.format(np.shape(fn)[0]))





No comments:

Post a Comment