Wednesday, December 30, 2020

Support Internet Explorer using Babel Transpilation

 



In this post we will review the steps request to support IE (Internet Explorer) in a webpack based javascript project.


First, we create a babel configuration file. It should be in the same folder as the package.json file. Notice the file includes the list of browsers that you want to support.


babel.config.json:

{
"sourceType": "unambiguous",
"presets": [
[
"@babel/preset-env",
{
"debug": false,
"targets": {
"edge": "17",
"firefox": "60",
"chrome": "67",
"safari": "11.1",
"ie": "11"
},
"useBuiltIns": "usage",
"corejs": {
"version": 3
}
}
]
],
"plugins": [
[
"@babel/plugin-proposal-decorators",
{
"decoratorsBeforeExport": true
}
],
[
"@babel/plugin-proposal-class-properties"
],
[
"@babel/transform-runtime"
]
]
}




Next, install the babel dependencies:


NPM commands:

npm install --save-dev @babel/core
npm install --save-dev @babel/plugin-proposal-class-properties
npm install --save-dev @babel/plugin-proposal-decorators
npm install --save-dev @babel/plugin-transform-runtime
npm install --save-dev @babel/preset-env
npm install --save-dev @babel/preset-typescript
npm install --save-dev @babel/runtime
npm install --save-dev babel-loader
npm install --save core-js



Now we need to update the webpack configuration to run babel, and to include the core-js library to fill all of the missing ES6 APIs (Only the relevant part of the file is shown here). In case you have non supported nodes_modules, replace them in the exclude section.


webpack.config.js:

module.exports = {
entry: ['core-js/stable', './main/index.js'],
target: 'es5', module: {
rules: [
{
test: /\.js$/,
exclude: /node_modules\/(?!(exclusemelib1|exclusemelib2))/,
loader: 'babel-loader',
},
],
},



To find the ES5 non supported library, you can use the following command:


are-you-es5 CLI:

npx are-you-es5 check -a PATH_TO_YOUR_PROJECT



However, not all of the output node_modules from these commands are really used. You should really guess, which ones are really relevant. You can also run your project in IE without minification, and find out according to the errors in the console, which modules are problematic.



Final Notes



I've wrote this post after scanning many sites, and it seems that the API for handling IE support keeps changing. I hope the description in the post will be stable enough to make it until you will need it...


Some of the good resources I've used are:

Wednesday, December 23, 2020

Persistence Volume Allocation on a Bare Metal Kubernetes



 

When we have a kubernetes based deployment, we usually need to use persistence volumes. This is usually required for databases such as Redis and ElasticSearch, but might be required for many other services.

In a cloud based kubernetes, such as GKE, and EKS, automatically provision the persistence volumes according to the persistence volume claims that our application deployments and statefulsets create.

But in a bare metal environments, which many use as a debugging environment, persistence volume are not provisioned automatically.

Until recently I have been using hostpath-provisioner to handle the persistence volumes allocation, but once I've upgarded my kubernetes to version 1.20, the hostpath-provisioner was broken with error: "selflink was empty". I could not find the reason for the failure.

However, I did find a simple bypass using HELM, and I think I should have used it anyways instead of using the hostpath-provisioner.

The idea is to use Helm (that is widely used for kubernetes deployments) to create the persistence volumes only when hostpath is specified as the storage class for the persistence volume.

For example, a statefulset would create a persistence volume claim:



apiVersion: apps/v1
kind: StatefulSet
metadata:
name: my-statefulset
spec:



volumeClaimTemplates:
- metadata:
name: pvc
spec:
accessModes: [ "ReadWriteOnce" ]
{{- if ne .Values.storageClass "" }}
storageClassName: "{{ .Values.storageClass }}"
{{- if eq .Values.storageClass "hostpath" }}
volumeName: my-pv
{{- end }}
{{- end }}
resources:
requests:
storage: {{ .Values.storageSize }}



Notice that in case the storage class is hostpath, we ask for a specific volume name.

Next we create the persistence volume, only if the storage class is hostpath:


{{- if eq .Values.storageClass "hostpath" }}
apiVersion: v1
kind: PersistentVolume
metadata:
name: my-pv
labels:
type: local
spec:
storageClassName: hostpath
capacity:
storage: {{ .Values.storageSize }}
accessModes:
- ReadWriteOnce
hostPath:
path: "/mnt/my-folder"
{{- end }}


That's it. 

No need of any special tricks.

Also, our helm chart support both deployment on a bare metal kubernetes, and on a cloud based kubernetes with a change of a single helm value.






Monday, December 14, 2020

Use SSL for NGINX

 



In this post we will review the steps required to configure SSL for NGINX.

Unlike many other articles, this post includes BOTH the NGINX configuration, and the keys creation steps.

This is intended for development environment, hence we will use a self signed certificate.



Create the Keys


We will create a server key and certificate.


sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout nginx.key -out nginx.crt


The NGINX Configuratoin


The following runs both HTTP and HTTPS servers.
If required, the related HTTP listener can be removed.


user  nginx;
worker_processes 10;

error_log /dev/stdout debug;
pid /var/run/nginx.pid;

load_module modules/ngx_http_js_module.so;
load_module modules/ngx_stream_js_module.so;

events {
worker_connections 10240;
}

http {
include /etc/nginx/mime.types;
default_type application/octet-stream;
include /etc/nginx/resolvers.conf;

log_format main '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" "$http_x_forwarded_for"';

access_log on;
sendfile on;
port_in_redirect off;
proxy_max_temp_file_size 0;
keepalive_requests 1000000;
keepalive_timeout 300s;

server {
listen 8080;
listen 8443 ssl;
server_name localhost;

ssl_certificate /etc/ssl/nginx.crt;
ssl_certificate_key /etc/ssl/nginx.key;

location / {
# your configuration here
}
}
}






Use SSL for NodeJS




In this post we will review the steps required to configure SSL for express server on NodeJS.

Unlike many other articles, this post includes BOTH the code changes, and the keys creation steps.

This is intended for development environment, hence we will use a self signed certificate.



Create the Keys


We will create a Certificate Authority (CA) key and certificate, and then we will create a server key and certificate.



rm -rf ./ssl
mkdir ./ssl
sudo openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 -subj "/C=US/ST=Denial/L=Springfield/O=Dis/CN=" -keyout ./ssl/key.pem -out ./ssl/cert.pem
sudo openssl req -newkey rsa:2048 -new -nodes -x509 -days 3650 -subj "/C=US/ST=Denial/L=Springfield/O=Dis/CN=" -keyout ./ssl/ca.key.pem -out ./ssl/ca.pem
sudo chmod -R 777 ./ssl


The NodeJS SSL Server


The following runs both HTTP and HTTPS servers.
If required, the related HTTP listener can be removed.


const express = require('express')
const fs = require('fs')
const cors = require('cors')
const https = require('https')

const app = express()

app.use(express.static('public'))
app.use(cors())

app.get('/', (req, res) => {
res.json({msg: 'OK'})
})

app.listen(8080)

const options = {
key: fs.readFileSync('/etc/ssl/key.pem'),
cert: fs.readFileSync('/etc/ssl/cert.pem'),
ca: fs.readFileSync('/etc/ssl/ca.pem'),
}

https.createServer(options, app).listen(8443)





Wednesday, December 9, 2020

Example for ANTLR usage in Python



 

In this post we will review usage of the ANTLR parser in a python application.

The code is base on the ANTLR Python documentation, and on some ANTLR samples.


For this post example, we will create a simple calculator that can parse simple expressions using the following operations:

  • plus
  • minus
  • divide
  • multiple



Installation


Install ANTLR:

sudo apt install antlr4



And then install the ANTLR python library:


pip3 install antlr4-python3-runtime



Create ANTLR Grammar


The ANTLR grammar file defines our language.

We create a calc.g4 file for it:


grammar calc;

expression
: plusExpression
| minusExpression
;

plusExpression
: priorityExpression (PLUS priorityExpression)*
;

minusExpression
: priorityExpression (MINUS priorityExpression)*
;

priorityExpression
: mulExpression
| divExpression
;

mulExpression
: atom (TIMES atom)*
;

divExpression
: atom (DIV atom)*
;

atom
: NUMBER
| LPAREN expression RPAREN
;

NUMBER
: ('0' .. '9') +
;

LPAREN
: '('
;


RPAREN
: ')'
;


PLUS
: '+'
;


MINUS
: '-'
;


TIMES
: '*'
;

DIV
: '/'
;



The grammar file is based on a recursive definition of expression. Notice that this will cause the ANTLR to build a tree, that we will later use to analyze the expression, and calculate the result.

In our case we must ensure to build the parsed tree according to the order of the mathematics operations, hence the expression is first spit to plus/minus expression, and only then to multiply/divide expression.


Example for tree (from http://csci201.artifice.cc/notes/antlr.html)





The Python Code


First generate the python code for parsing based on the ANTLR grammar file:

antlr4 -Dlanguage=Python3 calc.g4



This create several files that we should import in our application:


from antlr4 import *

from calcLexer import calcLexer
from calcListener import calcListener
from calcParser import calcParser



Next we create a listener. The listener inherits the generated listener and can override its methods to run some logic. In our case the logic will be run upon exit of each rule, which means it will be run after the child nodes of the current expression were already scanned.


class MyListener(calcListener):
def __init__(self):
self.verbose = False
self.stack = []

def debug(self, *argv):
if self.verbose:
print(*argv)

def debug_items(self, operation, items):
if len(items) == 1:
self.debug(operation, items[0])
else:
self.debug(operation.join(map(str, items)), "=", self.stack[-1])
self.debug("stack is {}".format(self.stack))

def exitAtom(self, ctx: calcParser.AtomContext):
number = ctx.NUMBER()
if number is not None:
value = int(str(ctx.NUMBER()))
self.stack.append(value)
self.debug("atom {}".format(value))

def exitPlusExpression(self, ctx: calcParser.PlusExpressionContext):
elements = len(ctx.PLUS()) + 1
items = self.stack[-elements:]
items_result = sum(items)
self.stack = self.stack[:-elements]
self.stack.append(items_result)
self.debug_items("+", items)

def exitMinusExpression(self, ctx: calcParser.MinusExpressionContext):
elements = len(ctx.MINUS()) + 1
items = self.stack[-elements:]
items_result = items[0] - sum(items[1:])
self.stack = self.stack[:-elements]
self.stack.append(items_result)
self.debug_items("-", items)

def exitMulExpression(self, ctx: calcParser.MulExpressionContext):
elements = len(ctx.TIMES()) + 1
items = self.stack[-elements:]
items_result = math.prod(items)
self.stack = self.stack[:-elements]
self.stack.append(items_result)
self.debug_items("*", items)

def exitDivExpression(self, ctx: calcParser.DivExpressionContext):
elements = len(ctx.DIV()) + 1
items = self.stack[-elements:]
if len(items) > 1:
items_result = items[0] / math.prod(items[1:])
else:
items_result = items[0]
self.stack = self.stack[:-elements]
self.stack.append(items_result)
self.debug_items("/", items)



Now we can parse an expression, and run our listener:


def parse(text, verbose=False):
input_stream = InputStream(text)
lexer = calcLexer(input_stream)
stream = CommonTokenStream(lexer)
parser = calcParser(stream)
tree = parser.expression()
listener = MyListener()
listener.verbose = verbose
walker = ParseTreeWalker()
walker.walk(listener, tree)
return listener.stack[0]



And finally, lets run some tests:



def test(text, expected):
print(text)
actual = parse(text)
print(text, "=", actual)

if actual != expected:
print("=== rerun in verbose ===")
parse(text, True)
raise Exception("expected {} but actual is {}".format(expected, actual))


test("1", 1)
test("1+2", 3)
test("3*4", 12)
test("10-8", 2)
test("10/2", 5)
test("4+2+3", 9)
test("90-10-20", 60)
test("(1)", 1)
test("(1)+(2)", 3)
test("(1+2)*(3+4)", 21)
test("(10-8)*(1+2+3)*4", 48)
test("(11-1)-(10-5)", 5)
test("(11-1)/(10-5)", 2)



Final Note


ANTLR is a great tool, that can be used in python, and in more lanuguages.
The listener API of ANTLR enables us to run any interpretation of the grammar result, and run our own logic, hence getting the max out of the parsed text.

 

Monday, December 7, 2020

Custom Redux Middleware

 

In this post we will create a custom redux middleware. 

A redux middleware listens for actions, and choose either to ignore it and let redux handle it, or handle the action by itself.

Let review the steps for using a custom redux middleware. First, add our middleware to the store default middlewares:



import {configureStore, getDefaultMiddleware} from '@reduxjs/toolkit'
import counterReducer from '../features/counter/counterSlice'

export default configureStore({
reducer: {
counter: counterReducer,
},
middleware: [
...getDefaultMiddleware(),
refreshMiddleware,
],
})



Next, activate our middleware only for a specific type of action:



function refreshMiddleware({dispatch, getState}) {
return next => action => {
if (action.type === 'refresh') {
handleAction(action, dispatch, getState)
} else {
return next(action)
}
}
}



And lastly, we want to handle our action.

In this case we will send an async request, and update redux with its result using a new type of action.



function handleAction(action, dispatch, getState) {
fetch('https://jsonplaceholder.typicode.com/todos/1')
.then(response => response.json())
.then(json => {
const state = getState()
dispatch({
type: 'set-user',
payload: {
counter: state.counter.value,
userId: json.userId,
},
})
},
)
}



Final Note


While easy to perform, custom middleware is poorly documented in the redux documentation. I hope this short post would assist you to implement your own middlware.

That's said, make sure to use redux middleware only as a last resort, a prefer using existing mechanisms such as simple reducers, and redux-thunk.


Tuesday, December 1, 2020

Web Sockets Using JavaScript Frontend and GO Backend


 

In this post we will review an example of using WebSockets.

We will use a JavaScript based client running on the browser, and sending requests to a GO backend. The JavasScript and the GO backend will send messages to each other over the web socket.



The JavaScript Side


To start our JavaScript project, it would be simple to begin with a predefined react based template:


npx create-react-app demo
cd demo
npm start


Then, in App.js, add our web socket handling code:


const ws = new WebSocket('ws://localhost:8080/ws')
ws.onmessage = function (evt) {
console.log(`got from server: ${evt.data}`)
}

ws.onopen = function () {
console.log('WS Connected')
setInterval(() => {
console.log(`sending to the server`)
ws.send('I am your client')
}, 1000)
}


We start by initiating a web socket connection. 


    ❗ we use the ws:// prefix, for a secure connection use wss:// prefix


Then, we wait for the web socket to establish, and send a message to the server.
In addition, whenever we get a message from the server, we print it.


The GO Side



The GO backend application uses an echo server.

First, let's configure the web server to handle the requests:


package main

import (
"fmt"
"github.com/gorilla/websocket"
"github.com/labstack/echo/v4"
"net/http"
"time"
)

func main() {
echoServer := echo.New()

echoServer.GET("/ws", serveWebSocket)

err := echoServer.Start(":8080")
if err != nil {
panic(err)
}
}


We use the /ws URL path. This should match the path specified on the JavaScript side.

Next, communicate over the socket:


var upgrader = websocket.Upgrader{
CheckOrigin: func(r *http.Request) bool {
return true
},
}

func serveWebSocket(ctx echo.Context) error {
ws, err := upgrader.Upgrade(ctx.Response(), ctx.Request(), nil)
if err != nil {
return err
}
defer ws.Close()

for {
err := ws.WriteMessage(websocket.TextMessage, []byte("Hello from the server"))
if err != nil {
return err
}
_, msg, err := ws.ReadMessage()
if err != nil {
return err
}
fmt.Printf("got from the client: %v\n", string(msg))

time.Sleep(time.Second)
}
return nil
}


The socket connection is upgraded to a web socket, and then we start the conversation.

    ❗ Implement CheckOrigin to return true to allow cross origin requests


The web socket library in use is the Gorilla WebSocket.


Debugging in Google Chrome


Google Chrome supplies a great logging of the entire web socket conversation.





NGINX


In case the JavaScript code is served using an NGINX, you will need to enable the web socket establishment. See the two upgrade related headers.


location /ws {
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_buffering off;
proxy_pass http://backend/ws;
}



Final Note


In this post we have reviewed the steps required to create a simple application that uses web sockets.

In a real life scenario, it is possible to connect the web socket messages to redux store, and handle them as if they were part of any other application messages.


Sunday, November 22, 2020

Nice JavaScript Features



 

In this post I will review some nice JavaScript tricks that I've lately found when reading the Modern JavaScript course. I have been using JavaScript for several years, but still, I have found some  JavaScript capabilities that I was not aware of. 


In general I'm less interested in advanced JavaScript syntax/internals which is not relevant for a well written code. For example:


console.log('2' + 1) // prints 21
console.log('' || false) //prints false

console.log('' || "yes") //prints yes
console.log('' ?? "yes") //prints empty line


It is nice to know when you look at these statements why do they act the way they do, but when writing a well clean code, one should avoid such a confusing behavior.


In the next sections, I will review the JavaScript items that are more relevant in my opinion.



BigInt


JavaScript numbers support numbers in range -(253-1) up to (253-1).

But, JavaScript has a built-in support for big integers, for example:


const big1 = 1234567890123456789012345678901234567890n
const big2 = 1234567890123456789012345678901234567890n
console.log(big1 + big2)



Debugger


Once in a while, when debugging the JavaScript code in Google Chrome, I've found Google Chrome developer tools getting confused. Even I've added a breakpoint in a specific location in the source, I does not stop there. This is mostly due to Hot-Reload in a debugging session, where the sources are reloaded as new files.

A nice trick to force a breakpoint, is to add the debugger statement in the source itself:

console.log('Hello')
debugger // break here
console.log('World')


Note that this breaks in the debugger statement only if the Google Chrome developer tools window is open.



Simple User Interaction


We all use the alert function to print some debug information, but what about getting input from the user? Turns out that we have built-in functions for that:


const name = window.prompt('Your name?', 'John')
const ok = window.confirm(`Hello ${name}, click OK to continue`)
alert(ok ? 'here we go' : 'quitting')



The prompt function opens a text box with an OK and a Cancel buttons.

The confirm function opens a message with an OK and a Cancel buttons.

We will not use this in a production environment, but it is great for debugging purposes.



Nullish Coalescing


When we want to assign a value to a variable only if the value is defined, we can use the nullish coalescing operator. Unlike the OR (||) operator, it will assign the value only if the value is defined.


console.log(null || undefined || false || 'true') // prints true
console.log(null ?? undefined ?? false ?? 'true') // prints false



Optional Chaining


This is a great solution for nested object properties, when you are not sure of the existence of a property. The optional chaining prevents if-conditions to check if the property exists to avoid the "cannot read property X of undefined".


const user1 = {
name: {
first: 'John',
},
}

const user2 = {}

console.log(user1?.name?.first)
console.log(user2?.name?.first)



Object Conversion


You can control the return value upon conversion of an object to a string and to a number.


function User(name, balance = 20) {
return {
name,
balance,
toString: function () {
return `user ${name}`
},
valueOf: function () {
return balance
},
}
}

const u1 = User('John')
console.log(u1) // prints { name: "John", balance: 20}
console.log('hello ' + u1) // prints hello user John
console.log(100 - u1) // prints 80



forEach method


The forEach method can get not only the array elements, but also the array index, and the array itself.


const colors = ['red', 'blue', 'green']
colors.forEach((color, index, array) => {
console.log(color, index, array)
})



Smart Function Parameters


This is very nice method for a function with many parameters, that we want to dynamically add parameters, without affecting the existing calling code section.


function f({firstName, midName = 'Van', lastName}) {
console.log(firstName, midName, lastName)
}

f({firstName: 'John', lastName: 'Smith'})



Sealing a Property


Javascript enables us to protect an object property. This is done by configuration of the property descriptor.


const obj = {}

Object.defineProperty(obj, "secret", {
value: "You cannot delete or change me",
writable: false,
configurable: false,
enumerable: false,
})

console.log(obj.secret)
delete obj['secret'] // ERROR: Cannot delete property 'secret'
obj.secret = "new value" // ERROR: Cannot assign to read only property 'secret' of object
Object.defineProperty(obj, 'secret', {configurable: true}) // ERROR: cannot redefine property


Some more sealing abilities are listed here.


Final Note


I've wrote this post mostly for myself, but it might be relevant for other senior JavaScript engineers who are, like myself, unaware of these nice JavaScript capabilities.



Thursday, November 19, 2020

HTML Content Security Policy



 

In this post we will review the usage of a content security policy (aka CSP) which is one of the methods to prevent XSS attacks. I have already reviewed one of the XSS related attacks, in a previous post: MageCart attack.



What is CSP?


The CSP is a method to block access to resources from a web page. Using CSP we can specify the allowed list of domains from which we allow to load style-sheets (CSS), images, videos, and more.



Why Should We Use CSP?


We want to prevent leakage of private information out of our site to an attacker site. We assume that this could occur in case one of our site 3rd party integrated was attacked, and a javascript was injected into it to leak out our private information. See an example int the post: MageCart attack.



How Do We Use CSP?


To use CSP, we can add an HTML meta tag, or a HTTP header in the HTML response:

Example of a meta tag:


<meta http-equiv="Content-Security-Policy" content="default-src 'self';">


Example of a HTTP header:


"Content-Security-Policy": "default-src 'self';"


The recommended method to use is the HTTP header, since the HTML meta tag does not support all of the CSP features.

The CSP header value, contains the CSP policy, whose syntax is as follows:


(CONFIGURATION_NAME VALUE_NAME+;)+


For example:


default-src 'self'; style-src 'self' 'unsafe-inline';



Which CSP Policy Should We Use?


If we want the most security, the policy should block any access to any external resource, so we should use:


default-src 'self';


But this would cause our entire site to fail. Why?

Because any internal script loading will be blocked, for example, the following javascript:


<script>
alert('Hello World!')
</script>


is blocked.

How can we allow our internal javascripts to run?

We can use hash and nonce to allow our scripts (see this for more details), but it requires many changes to our site source, and to the server side.

A less costly method is to allow all scripts to run. This is OK only if our only purpose is to prevent data leakage, and we do not intend to prevent malicious internal site actions. To do this, we use the unsafe-inline directive:


default-src 'self' 'unsafe-inline';


The last step is to whitelist the domains that we do allow to access from our site, for example:


default-src 'self' 'unsafe-inline' https://connect.facbook.net;



Using CSP Report



Activating CSP protection is the first step in our site protection. The next step is to monitor the CSP: which URLs were blocked?
CSP provides a simple method to report the blocked resources. However, the server side should be implemented our our own.

To enable CSP reporting, use the report-uri directive:


default-src 'self' 'unsafe-inline'; report-uri http://myreport.com;

This sends a JSON request per each blocked resource.
An example of such request is:


{
"csp-report": {
"blocked-uri": "https://www.ynet.co.il/images/og/logo/www.ynet.co.il.png",
"disposition": "enforce",
"document-uri": "http://localhost:3000/domains",
"effective-directive": "img-src",
"line-number": 37,
"original-policy": "default-src 'self'; style-src 'self' 'unsafe-inline'; script-src 'self' 'unsafe-inline' ;report-uri http://127.0.0.1:2121/report;",
"referrer": "http://localhost:3000/domains",
"script-sample": "",
"source-file": "http://localhost:3000/domains",
"status-code": 200,
"violated-directive": "img-src"
}
}


Final Note


We have reviewed the HTML CSP as a method to block data leakage.

However, some attacks might use a whitelisted domain to address his own account within this domain. This can be done, for example, using the Google analytics domain.


Wednesday, November 11, 2020

Redis Pub/Sub using go-redis library

 



In this post we will review the usage of Redis Pub/Sub using a GO code that uses the go-redis library.


Our main code initiates a connection to Redis, and then starts two subscribers, and two publishers. Since we start the subscribers and the publishers as GO routines, we add sleep of 5 seconds to avoid immediate termination of the process.


package main

import (
"context"
"fmt"
"github.com/go-redis/redis/v8"
"time"
)

const channel = "my-channel"

func main() {
address := "127.0.0.1:5555"
options := redis.Options{
Addr: address,
Password: "",
DB: 0,
}
client := redis.NewClient(&options)

go subscriber(1,client)
go subscriber(2,client)
go publisher(1,client)
go publisher(2,client)
time.Sleep(5 * time.Second)
}



The subscriber loops forever on the ReceiveMessage calls, and prints them to the STDOUT.


func subscriber(subscriberId int, client *redis.Client) {
ctx := context.Background()
pubsub := client.Subscribe(ctx, channel)
for {
message, err := pubsub.ReceiveMessage(ctx)
if err != nil {
panic(err)
}
fmt.Printf("subscriber %v got notification: %s\n",subscriberId, message.Payload)
}
}


And each of the publishers sends 3 messages to the channel.


func publisher(publisherId int, client *redis.Client) {
ctx := context.Background()
for i := 0; i < 3; i++ {
client.Publish(ctx, channel, fmt.Sprintf("Hello #%v from publisher %v", i, publisherId))
}
}


Once we run this small application, we get the following output:


subscriber 1 got notification: Hello #1 from publisher 1
subscriber 2 got notification: Hello #1 from publisher 1
subscriber 2 got notification: Hello #1 from publisher 2
subscriber 2 got notification: Hello #2 from publisher 2
subscriber 1 got notification: Hello #1 from publisher 2
subscriber 1 got notification: Hello #2 from publisher 2
subscriber 2 got notification: Hello #2 from publisher 1
subscriber 1 got notification: Hello #2 from publisher 1


But, wait.

We have 2 producers, 2 subscribers, and 3 messages. That's 2*2*3 = 12 expected messages to the STDOUT, but we got only 8 messages.

The reason for that is the Redis Pub/Sub behavior, which does behave as a queue. Instead only the active subscribers will get notified with messages in the channel. As the subscribers are not active yet when the first messages are sent, these messages are not sent to any subscriber.

If we wanted all of the messages to be received, we should wait (e.g. sleep), after launching the subscribers GO routines, and before starting the publishers.


Final Note


In this post we have reviewed the Redis Pub/Sub usage, and its behavior.

When running Pub/Sub in a Redis cluster, the messages will be broadcast to all of the cluster nodes, which might be a performance issue. In case this is indeed a performance issue, it is possible to consider the Redis streams instead.

Wednesday, November 4, 2020

Using Soundex and Levenshtein-Distance in Python



In this post we will review usage of the Soundex and the Levenshtein-Distance algorithms to check words similarities. Our goal is to implement a method to decide if the a given word is similar to one of a list of a predefined well known words that we have.

For example, we could have the following list of predefined words:


predefined_words = [
"user",
"account",
"address",
"name",
"firstname",
"lastname",
"surname",
"credit",
"card",
"password",
"pass",
]


Given a new word such as "uzer", we would like to find if it match any of our predefined words.


Soundex

One method is to use the Soundex function. 

The Soundex function, creates a string of 4 characters representing the phonetic sound of the the word. The following code random words pairs, and prints the similar words:


import random
import string

import jellyfish


def match_soundex(str1, str2):
sound_ex1 = jellyfish.soundex(str1)
sound_ex2 = jellyfish.soundex(str2)
return sound_ex1 == sound_ex2


def random_word():
word_len = random.randint(4, 8)
word = ""
for _ in range(word_len):
word += random.choice(string.ascii_letters)
return word.lower()


for i in range(10000):
w1 = random_word()
w2 = random_word()
if match_soundex(w1, w2):
print(w1, w2)


and the output is:


ylqhso yloja
wpppw wbuihu
doyk dhyazgg
vvzbzpam vskpakt
gxtjh gxdzu
pgpeg pspqnug
xahbfhs xvex


Levenshtein Distance

Another method is the Levenshtein-Distance.

The Levenshtein-Distance calculates using dynamic programming, how many changes should be done in order to change one string into a second string.

The following code random words pairs, and prints the similar words. For our purpose, 2 word will be similar if less than 20% of the characters were changed.


import random
import string

import jellyfish


def match_levenshtein(str1, str2):
distance = jellyfish.levenshtein_distance(str1, str2)
min_len = min(len(str1), len(str2))
return distance / min_len < 0.2


def random_word():
word_len = random.randint(4, 8)
word = ""
for _ in range(word_len):
word += random.choice(string.ascii_letters)
return word.lower()


for i in range(10000):
w1 = random_word()
w2 = random_word()
if match_levenshtein(w1, w2):
print(w1, w2)



and the output is:

wyqg wxeo
khuqw kqosz
wvhy weve
yqspuzc ycpg
rgvo rkwgpo
nhgxbag njqvxk
woebbbkf wvkpfyf


The Lookup Implementation

Now we can use a combination of these two functions to look for similar words. if soundex or the levenshtein-distance return that we have a match, we will declare that the string if found.


for i in range(1000):
w1 = random_word()
for predefined in predefined_words:
if match_soundex(w1, predefined) or match_levenshtein(w1, predefined):
print(w1, predefined)


Final Note


In this post we have presented a method for finding words similarity.

Per my tests, the soundex function find similarities even for words that do not look similar. This is due to the default of using a 4 characters string to represent the sound of the word. The jellyfish python library (that we've used) does not allow changing the default length. For production usages, I recommend using a library that does allow the default change.

Wednesday, October 28, 2020

Isolation Forest GO Implementation

isolation forest (image taken from the wikipedia site)

 


Lately I have implemented and checked an isolation tree algorithm to detect anomalies in the data. The implementation is available in github: https://github.com/alonana/isolationforest.


The isolation forest algorithm is presented in this article, and you can find examples and illustrations in this post.

There is even an old implementation in GO, but I found some issues about it (for example, it randoms a split attribute even it has only one value), so I do not recommend using it.


An example of usage is available as a unit test in the github itself.


package isolationforest

import (
"fmt"
"github.com/alonana/isolationforest/point"
"github.com/alonana/isolationforest/points"
"math/rand"
"testing"
)

func Test(t *testing.T) {
baseline := points.Create()

for value := 0; value < 1000; value++ {
x := float32(rand.NormFloat64())
y := float32(rand.NormFloat64())
baseline.Add(point.Create(x, y))
}

f := Create(100, baseline)

for radix := 0; radix < 10; radix++ {
fmt.Printf("radix %v score: %v\n", radix, f.Score(point.Create(float32(radix), float32(radix))))
}
}


The test adds 1000 points with a normal distribution: the mean is zero, and the standard deviation is 1.

Then, it checks the score for points (0,0), and (1,1), and (2,2), and so on.

The output is:


radix 0 score: 0.4144628
radix 1 score: 0.45397913
radix 2 score: 0.6438788
radix 3 score: 0.7528539
radix 4 score: 0.7821442
radix 5 score: 0.7821442
radix 6 score: 0.7821442
radix 7 score: 0.7821442
radix 8 score: 0.7821442
radix 9 score: 0.7821442


So a point with ~1 standard deviation, gets a score < 0.5, as expected, while points more far from the mean, get score > 0.5.


In the tests I've performed, I've found the isolation forest algorithm functioning well for a normal distributed data, with many data records, but for discrete values, and for small amount of data, it did not performed well. The reason is that the path length to anomalies data was almost similar to non-anomalies data points, due to random selection of the segmentation values.