Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Wednesday, October 28, 2020

Isolation Forest GO Implementation

isolation forest (image taken from the wikipedia site)

 


Lately I have implemented and checked an isolation tree algorithm to detect anomalies in the data. The implementation is available in github: https://github.com/alonana/isolationforest.


The isolation forest algorithm is presented in this article, and you can find examples and illustrations in this post.

There is even an old implementation in GO, but I found some issues about it (for example, it randoms a split attribute even it has only one value), so I do not recommend using it.


An example of usage is available as a unit test in the github itself.


package isolationforest

import (
"fmt"
"github.com/alonana/isolationforest/point"
"github.com/alonana/isolationforest/points"
"math/rand"
"testing"
)

func Test(t *testing.T) {
baseline := points.Create()

for value := 0; value < 1000; value++ {
x := float32(rand.NormFloat64())
y := float32(rand.NormFloat64())
baseline.Add(point.Create(x, y))
}

f := Create(100, baseline)

for radix := 0; radix < 10; radix++ {
fmt.Printf("radix %v score: %v\n", radix, f.Score(point.Create(float32(radix), float32(radix))))
}
}


The test adds 1000 points with a normal distribution: the mean is zero, and the standard deviation is 1.

Then, it checks the score for points (0,0), and (1,1), and (2,2), and so on.

The output is:


radix 0 score: 0.4144628
radix 1 score: 0.45397913
radix 2 score: 0.6438788
radix 3 score: 0.7528539
radix 4 score: 0.7821442
radix 5 score: 0.7821442
radix 6 score: 0.7821442
radix 7 score: 0.7821442
radix 8 score: 0.7821442
radix 9 score: 0.7821442


So a point with ~1 standard deviation, gets a score < 0.5, as expected, while points more far from the mean, get score > 0.5.


In the tests I've performed, I've found the isolation forest algorithm functioning well for a normal distributed data, with many data records, but for discrete values, and for small amount of data, it did not performed well. The reason is that the path length to anomalies data was almost similar to non-anomalies data points, due to random selection of the segmentation values.


No comments:

Post a Comment