Lately I have implemented and checked an isolation tree algorithm to detect anomalies in the data. The implementation is available in github: https://github.com/alonana/isolationforest.
The isolation forest algorithm is presented in this article, and you can find examples and illustrations in this post.
There is even an old implementation in GO, but I found some issues about it (for example, it randoms a split attribute even it has only one value), so I do not recommend using it.
An example of usage is available as a unit test in the github itself.
package isolationforest
import (
"fmt"
"github.com/alonana/isolationforest/point"
"github.com/alonana/isolationforest/points"
"math/rand"
"testing"
)
func Test(t *testing.T) {
baseline := points.Create()
for value := 0; value < 1000; value++ {
x := float32(rand.NormFloat64())
y := float32(rand.NormFloat64())
baseline.Add(point.Create(x, y))
}
f := Create(100, baseline)
for radix := 0; radix < 10; radix++ {
fmt.Printf("radix %v score: %v\n", radix, f.Score(point.Create(float32(radix), float32(radix))))
}
}
The test adds 1000 points with a normal distribution: the mean is zero, and the standard deviation is 1.
Then, it checks the score for points (0,0), and (1,1), and (2,2), and so on.
The output is:
radix 0 score: 0.4144628 radix 1 score: 0.45397913 radix 2 score: 0.6438788 radix 3 score: 0.7528539 radix 4 score: 0.7821442 radix 5 score: 0.7821442 radix 6 score: 0.7821442 radix 7 score: 0.7821442 radix 8 score: 0.7821442 radix 9 score: 0.7821442
So a point with ~1 standard deviation, gets a score < 0.5, as expected, while points more far from the mean, get score > 0.5.
In the tests I've performed, I've found the isolation forest algorithm functioning well for a normal distributed data, with many data records, but for discrete values, and for small amount of data, it did not performed well. The reason is that the path length to anomalies data was almost similar to non-anomalies data points, due to random selection of the segmentation values.
No comments:
Post a Comment