run KISS: Using Random Forest in Python

image from https://en.wikipedia.org/wiki/Random_forest

In this post we will review usage of a random forest classifier in python.

We use a very simple CSV as input. In real life you will have many columns, and complex data.

height,weight,person
80,40,child
70,30,child
50,10,child
180,80,adult
170,80,adult
185,80,adult

First we load the CSV to a data frame, and print its head.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

df = pd.read_csv("input.csv")
print(df.head(5))

The random forest works with floats, both on features, and on labels. Hence we convert the person column to an int label:

def convert_to_int(row):
    if row['person'] == 'adult':
        return 1
    return 0


df['is_adult'] = df.apply(lambda row: convert_to_int(row), axis=1)
df.drop(labels=['person'], axis=1, inplace=True)

Next we split the data to training and testing segments:

labels = np.array(df['is_adult'])
features = df.drop('is_adult', axis=1)
feature_list = list(features.columns)
features = np.array(features)
train_features, test_features, train_labels, test_labels = \
    train_test_split(features,
                     labels,
                     test_size=0.25,
                     random_state=42,
                     )
print('features shape {} labels shape {}'.format(
    train_features.shape, train_labels.shape))
print('features shape {} labels shape {}'.format(
    test_features.shape, test_labels.shape))

with np.printoptions(threshold=np.inf):
    print(train_features)
    print(train_labels)

Let's examine a dummy model as a baseline. This model always guess that we have a child, and not an adult.

baseline_predictions = np.full(test_labels.shape, 0)
baseline_errors = abs(baseline_predictions - test_labels)

with np.printoptions(threshold=np.inf):
    print("baseline predictions", baseline_predictions)
    print("baseline errors",baseline_errors)

print('error baseline {}'.format(
    round(np.mean(baseline_errors), 3)))

Now let create the random forest classifier, and check its error rate.

forest = RandomForestRegressor(n_estimators=1000, random_state=42)
forest.fit(train_features, train_labels)

predictions = forest.predict(test_features)

prediction_threshold = 0.5
predictions[predictions < prediction_threshold] = 0
predictions[predictions >= prediction_threshold] = 1
with np.printoptions(threshold=np.inf):
    print(predictions)

prediction_errors = predictions - test_labels
print('error for test {}'.format(
    round(np.mean(abs(prediction_errors)), 3), 'degrees.'))

We can check the importance of each feature in the model:

importances = list(forest.feature_importances_)
feature_importances = [(feature, round(importance, 2)) for feature, importance in
                       zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)
for pair in feature_importances:
    print('variable: {} Importance: {}'.format(*pair))

Lastly, we can examine true/false positive/negative rate:

joined = np.stack((predictions, test_labels), axis=1)
tp = joined[np.where(
    (joined[:, 0] == 1) *
    (joined[:, 1] == 1)
)]
tn = joined[np.where(
    (joined[:, 0] == 0) *
    (joined[:, 1] == 0)
)]
fp = joined[np.where(
    (joined[:, 0] == 1) *
    (joined[:, 1] == 0)
)]
fn = joined[np.where(
    (joined[:, 0] == 0) *
    (joined[:, 1] == 1)
)]
print('true positive {}'.format(np.shape(tp)[0]))
print('true negative {}'.format(np.shape(tn)[0]))
print('false positive {}'.format(np.shape(fp)[0]))
print('false negative {}'.format(np.shape(fn)[0]))

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Wednesday, December 22, 2021

Using Random Forest in Python

No comments:

Post a Comment