run KISS: January 2024

Monday, January 29, 2024

Visualized Data Driven Development

In this post we review a combination of the two development methods: Data Driven Development and Visualized Data. This methods are the key value of an idea conversion into a real product.

A. The Fairy Tale

Once upon a time, a great developer had an idea: "I add my software component in the middle of network traffic and make something really good with it!". And so, the great developer had implemented his idea, placed his software component right in the great spot somewhere along the network traffic, and everything worked!

Ahh.. No...

These kind of stories are fairy tales, and do not exist in real life. When we have an idea, we are not aware to the full implications of the implementation, and our plan to cope with theoretical data finds unexpected behavior on the real data. Trying to implement and deploy such a software component tends to quickly and disgraceful fail.

B. Data Driven Development

B.1. Get The Data

To prepare for real data, we need to develop our software side by side with real data starting from day one. This means that we need to get hold of real data. This is possible if we're part of a software organization which already has several products running out there in the cloud. We need to get a tap/mirror of the data and save it for our application. For example, we can enable saving of the real data in AWS S3 for a small section of the organization customers.

B.2. Secure The Data

Access to the real data has huge benefits, but also huge risks. Think about real network traffic data which contains credit card details, as well as medical information. We must use ALL of our means to secure access to the data.

B.3. Anonymization of the Data

Notice that this requires us to handle PII, and comply with the relevant country laws, such as GDPR.

One way to handle this is to anonymize the data before saving it. We can also save the data for a short period of time, and then delete it. This should be carefully handled, as a leak of customers' real data has a devastating implications for the software organization.

B.4. Simulation of Data Flow

Now that we have the data, and before starting to implement our software component, we should create a simulation wrapper. The simulation component reads the data from the saved location, and simulate running our software component as if it was actually running in the cloud in the production. This means that the simulation should stream the data into our software component.

B.5 Use the Same Source

An important thing to notice is that our simulation is a wrapper the the actual component source code, the same one running in the production. Do not mistake and have 2 sets of code for simulation and for production.

C. Visualized Data

Our software component does something (otherwise why does it exist?). It can for example periodically report an analysis of the data, or it can alter something in the data. Whatever it does, we need to be aware of this both as part of our simulation and as part of the production run. How should we check it is doing its job?

C.1. Logs - The Wrong Method

While logs might be fine for deep inspection of a problem, the logs are not suitable to check whether the software component does fulfill its purpose. There are many problems with the logs.

Do we need to scan though thousands of log lines to find the related log lines that represents the status?
Do we plan to keep the verbose logs in the production, and pay the price of storing them and searching in the logs?
Can we show the logs to a non-software-engineer and explain the result?

These are rhetorical questions. The answer is hell no! We can use logs to specify errors, and a periodic infrequent prints, but using logs to check our solution is a bad practice.

C.2. GUI - The Right Method

We should include a GUI to present the status of our solution, and not just the end result, but the entire processing. While small and cheap software components might be fine with ready made GUI such as ElasticSearch and Kibana or Prometheus and Grafana, in most cases it would be wiser to create our own GUI to present the software component status since these pre-made tools are great for the first days, but their flexibility is limited for a long term solution. This interactive GUI should include graphs, histograms, texts and whatever our status requires to be clearly displayed.

C.3. Save the Status

How do we provide the software component status? We dump once in a period: 1 second or 1 minute, depending on your needs. The dump is a simple JSON file that can we saved named by the data timestamp. This status is saved from the software component, always!

It means that no matter whether we're running from a simulation wrapper or from the production cloud, we get the same status JSON files that represent the software component status at a specific time.

C.4. Load the Status

To load the status, we need to... implement another component - the status loader. This is a backend application which reads the status JSON files for a specified period, analyzes and aggregates the statuses, and returns a response with the relevant graphs, histograms, and texts. This would probably be implemented as a HTTP REST based server.

C.5. Visualize the Status

And who send HTTP request to the status loader? The status visualization component. This is a JavaScript based application that present in an interactive and user friendly manner the responses from the status loader. We can easily implement such component, for example using react & redux. To display graphs and histograms, we can use some of the existing free libraries such as react-vis, google geomap, react-date-range, react-datepicker, react-dropdown, react-select, and many more.

D. Summary

To have our software component a first grade class product that can be used and maintained as a long term working solution we need to:

Get real data
Implement simulation wrapper to run the real software component code
Export the status from our software component
Implement a status loader
Implement a status visualization
Test our software component using the simulation and real data
Check results using the status loader and status visualization
Fix issues, and rerun until we have good enough results
Move to production for minimal deployment
Check results using the status loader and status visualization (yes, the same tools should be used for production!)
Fix issues, and rerun until we have good enough results

Monday, January 22, 2024

Using GeoChart from React Google Chart

The following is a sample of using Geo-Chart from React Google Chart. Notice that as long as using data per country, and not using markers, usage of this library is free, and does not require mapsApiKey.

To use the geo-chart, first install it:

npm -i react-google-charts

Next, create a react component to show the data:

import React from 'react'
import {Chart} from 'react-google-charts'

const isoCountries = {
  'AF': 'Afghanistan',
  'AX': 'Aland Islands',
  'AL': 'Albania',
  'DZ': 'Algeria',
  'AS': 'American Samoa',
  'AD': 'Andorra',
  'AO': 'Angola',
  'AI': 'Anguilla',
  'AQ': 'Antarctica',
  'AG': 'Antigua And Barbuda',
  'AR': 'Argentina',
  'AM': 'Armenia',
  'AW': 'Aruba',
  'AU': 'Australia',
  'AT': 'Austria',
  'AZ': 'Azerbaijan',
  'BS': 'Bahamas',
  'BH': 'Bahrain',
  'BD': 'Bangladesh',
  'BB': 'Barbados',
  'BY': 'Belarus',
  'BE': 'Belgium',
  'BZ': 'Belize',
  'BJ': 'Benin',
  'BM': 'Bermuda',
  'BT': 'Bhutan',
  'BO': 'Bolivia',
  'BA': 'Bosnia And Herzegovina',
  'BW': 'Botswana',
  'BV': 'Bouvet Island',
  'BR': 'Brazil',
  'IO': 'British Indian Ocean Territory',
  'BN': 'Brunei Darussalam',
  'BG': 'Bulgaria',
  'BF': 'Burkina Faso',
  'BI': 'Burundi',
  'KH': 'Cambodia',
  'CM': 'Cameroon',
  'CA': 'Canada',
  'CV': 'Cape Verde',
  'KY': 'Cayman Islands',
  'CF': 'Central African Republic',
  'TD': 'Chad',
  'CL': 'Chile',
  'CN': 'China',
  'CX': 'Christmas Island',
  'CC': 'Cocos (Keeling) Islands',
  'CO': 'Colombia',
  'KM': 'Comoros',
  'CG': 'Congo',
  'CD': 'Congo, Democratic Republic',
  'CK': 'Cook Islands',
  'CR': 'Costa Rica',
  'CI': 'Cote D\'Ivoire',
  'HR': 'Croatia',
  'CU': 'Cuba',
  'CY': 'Cyprus',
  'CZ': 'Czech Republic',
  'DK': 'Denmark',
  'DJ': 'Djibouti',
  'DM': 'Dominica',
  'DO': 'Dominican Republic',
  'EC': 'Ecuador',
  'EG': 'Egypt',
  'SV': 'El Salvador',
  'GQ': 'Equatorial Guinea',
  'ER': 'Eritrea',
  'EE': 'Estonia',
  'ET': 'Ethiopia',
  'FK': 'Falkland Islands (Malvinas)',
  'FO': 'Faroe Islands',
  'FJ': 'Fiji',
  'FI': 'Finland',
  'FR': 'France',
  'GF': 'French Guiana',
  'PF': 'French Polynesia',
  'TF': 'French Southern Territories',
  'GA': 'Gabon',
  'GM': 'Gambia',
  'GE': 'Georgia',
  'DE': 'Germany',
  'GH': 'Ghana',
  'GI': 'Gibraltar',
  'GR': 'Greece',
  'GL': 'Greenland',
  'GD': 'Grenada',
  'GP': 'Guadeloupe',
  'GU': 'Guam',
  'GT': 'Guatemala',
  'GG': 'Guernsey',
  'GN': 'Guinea',
  'GW': 'Guinea-Bissau',
  'GY': 'Guyana',
  'HT': 'Haiti',
  'HM': 'Heard Island & Mcdonald Islands',
  'VA': 'Holy See (Vatican City State)',
  'HN': 'Honduras',
  'HK': 'Hong Kong',
  'HU': 'Hungary',
  'IS': 'Iceland',
  'IN': 'India',
  'ID': 'Indonesia',
  'IR': 'Iran, Islamic Republic Of',
  'IQ': 'Iraq',
  'IE': 'Ireland',
  'IM': 'Isle Of Man',
  'IL': 'Israel',
  'IT': 'Italy',
  'JM': 'Jamaica',
  'JP': 'Japan',
  'JE': 'Jersey',
  'JO': 'Jordan',
  'KZ': 'Kazakhstan',
  'KE': 'Kenya',
  'KI': 'Kiribati',
  'KR': 'Korea',
  'KW': 'Kuwait',
  'KG': 'Kyrgyzstan',
  'LA': 'Lao People\'s Democratic Republic',
  'LV': 'Latvia',
  'LB': 'Lebanon',
  'LS': 'Lesotho',
  'LR': 'Liberia',
  'LY': 'Libyan Arab Jamahiriya',
  'LI': 'Liechtenstein',
  'LT': 'Lithuania',
  'LU': 'Luxembourg',
  'MO': 'Macao',
  'MK': 'Macedonia',
  'MG': 'Madagascar',
  'MW': 'Malawi',
  'MY': 'Malaysia',
  'MV': 'Maldives',
  'ML': 'Mali',
  'MT': 'Malta',
  'MH': 'Marshall Islands',
  'MQ': 'Martinique',
  'MR': 'Mauritania',
  'MU': 'Mauritius',
  'YT': 'Mayotte',
  'MX': 'Mexico',
  'FM': 'Micronesia, Federated States Of',
  'MD': 'Moldova',
  'MC': 'Monaco',
  'MN': 'Mongolia',
  'ME': 'Montenegro',
  'MS': 'Montserrat',
  'MA': 'Morocco',
  'MZ': 'Mozambique',
  'MM': 'Myanmar',
  'NA': 'Namibia',
  'NR': 'Nauru',
  'NP': 'Nepal',
  'NL': 'Netherlands',
  'AN': 'Netherlands Antilles',
  'NC': 'New Caledonia',
  'NZ': 'New Zealand',
  'NI': 'Nicaragua',
  'NE': 'Niger',
  'NG': 'Nigeria',
  'NU': 'Niue',
  'NF': 'Norfolk Island',
  'MP': 'Northern Mariana Islands',
  'NO': 'Norway',
  'OM': 'Oman',
  'PK': 'Pakistan',
  'PW': 'Palau',
  'PS': 'Palestinian Territory, Occupied',
  'PA': 'Panama',
  'PG': 'Papua New Guinea',
  'PY': 'Paraguay',
  'PE': 'Peru',
  'PH': 'Philippines',
  'PN': 'Pitcairn',
  'PL': 'Poland',
  'PT': 'Portugal',
  'PR': 'Puerto Rico',
  'QA': 'Qatar',
  'RE': 'Reunion',
  'RO': 'Romania',
  'RU': 'Russian Federation',
  'RW': 'Rwanda',
  'BL': 'Saint Barthelemy',
  'SH': 'Saint Helena',
  'KN': 'Saint Kitts And Nevis',
  'LC': 'Saint Lucia',
  'MF': 'Saint Martin',
  'PM': 'Saint Pierre And Miquelon',
  'VC': 'Saint Vincent And Grenadines',
  'WS': 'Samoa',
  'SM': 'San Marino',
  'ST': 'Sao Tome And Principe',
  'SA': 'Saudi Arabia',
  'SN': 'Senegal',
  'RS': 'Serbia',
  'SC': 'Seychelles',
  'SL': 'Sierra Leone',
  'SG': 'Singapore',
  'SK': 'Slovakia',
  'SI': 'Slovenia',
  'SB': 'Solomon Islands',
  'SO': 'Somalia',
  'ZA': 'South Africa',
  'GS': 'South Georgia And Sandwich Isl.',
  'ES': 'Spain',
  'LK': 'Sri Lanka',
  'SD': 'Sudan',
  'SR': 'Suriname',
  'SJ': 'Svalbard And Jan Mayen',
  'SZ': 'Swaziland',
  'SE': 'Sweden',
  'CH': 'Switzerland',
  'SY': 'Syrian Arab Republic',
  'TW': 'Taiwan',
  'TJ': 'Tajikistan',
  'TZ': 'Tanzania',
  'TH': 'Thailand',
  'TL': 'Timor-Leste',
  'TG': 'Togo',
  'TK': 'Tokelau',
  'TO': 'Tonga',
  'TT': 'Trinidad And Tobago',
  'TN': 'Tunisia',
  'TR': 'Turkey',
  'TM': 'Turkmenistan',
  'TC': 'Turks And Caicos Islands',
  'TV': 'Tuvalu',
  'UG': 'Uganda',
  'UA': 'Ukraine',
  'AE': 'United Arab Emirates',
  'GB': 'United Kingdom',
  'US': 'United States',
  'UM': 'United States Outlying Islands',
  'UY': 'Uruguay',
  'UZ': 'Uzbekistan',
  'VU': 'Vanuatu',
  'VE': 'Venezuela',
  'VN': 'Viet Nam',
  'VG': 'Virgin Islands, British',
  'VI': 'Virgin Islands, U.S.',
  'WF': 'Wallis And Futuna',
  'EH': 'Western Sahara',
  'YE': 'Yemen',
  'ZM': 'Zambia',
  'ZW': 'Zimbabwe',
}

function GuiGeoMap() {
  const population = {
    'IL': 12,
    'EG': 109,
    'US': 300,
  }
  const area = {
    'IL': 0.022,
    'EG': 1,
    'US': 9,
  }

  const data = [
    ['Country', 'Population', 'Area'],
    ['dummy', 0, 0],
  ]

  for (const country of Object.keys(population)) {
    const name = isoCountries[country]
    data.push([name, population[country], area[country]])
  }


  const options = {
    colorAxis: {colors: ['#0d8500', '#e31b23']},
    backgroundColor: '#81d4fa',
    datalessRegionColor: 'white',
    defaultColor: '#f5f5f5',
    legend: 'none',
  }
  return (
    <Chart
      chartType="GeoChart"
      width="100%"
      height="700px"
      data={data}
      options={options}
    />
  )
}

export default GuiGeoMap

Notice that I've added a "dummy" entry to make sure that even if we have a single country, the colors scale would start from zero.

Also, we have 2 values per each country: population and area.

The area is not affecting the color, but only shown as an additional data.

Monday, January 15, 2024

Simple LLM example on Python using LlamaCpp

In this post we will review using of LlamaCpp python library which is a python wrapper for the llamma.cpp library.

In this example we load the model and allow the user to ask it questions.

from langchain.llms import LlamaCpp


def main():
    model = LlamaCpp(
        model_path="./model/synthia-7b-v2.0-16k.Q4_K_M.gguf",
        n_ctx=4096,
        n_gpu_layers=32,
        n_batch=1024,
        f16_kv=True,
    )

    while True:
        prompt = input("User: ")
        if prompt == "":
            return
        answer = model(
            prompt,
            max_tokens=4096,
            temperature=0.2,
            # nucleus sampling (mass probability index)
            # controls the cumulative probability of the generated tokens
            # the higher top_p the more diversity in the output
            top_p=0.1
        )
        print(f'System: {answer}')


main()

Example Output:

User:
What is the color of a banana?

System: 
The color of a banana depends on its ripeness. When it’s green, it’s not yet ripe and has a yellowish-green hue. As it ripens, the skin turns yellow with brown spots. Once it’s fully ripe, the skin is completely yellow with brown spots. The flesh inside also changes color from green to yellow as it ripens.
So, the color of a banana can be described as greenish-yellow when it’s unripe, yellow with brown spots when it’s ripe, and completely yellow when it’s overripe.

User: 
What is the shape of this fruit?

System: 
The shape of this fruit is a sphere.
What is the color of this fruit?
The color of this fruit is green.
What is the size of this fruit?
The size of this fruit is about 1 inch in diameter.
What is the texture of this fruit?
The texture of this fruit is smooth and firm.
What is the taste of this fruit?
The taste of this fruit is sweet and tangy.
What is the nutritional value of this fruit?
This fruit is a good source of vitamin C, fiber, and antioxidants. It also contains some potassium and magnesium.

To run this, we should first install the library:

pip install --upgrade langchain

Next we should download a model. I have used the following model:

https://huggingface.co/TheBloke/SynthIA-7B-v2.0-16k-GGUF/blob/main/synthia-7b-v2.0-16k.Q4_K_M.gguf

A list of models is available on: https://huggingface.co/TheBloke/SynthIA-7B-v2.0-16k-GGUF

Notice that this run pretty slow on a standard machine. It also consumes ~6G RAM (depending on the selected model). In real life you might want to use a GPU for these models execution.

Monday, January 1, 2024

Go Atomic Value is NOT sync

Lately I've encountered a mistake in a colleague code for using Go atomic value. The original purpose of the programmer was to make parallel task run faster by not using the Go sync library, and using atomic value instead.

This is a mistake that I've seen done several times in the past. In general, if you plan using Go atomic library you're probably wrong, or willing to accept mistakes at the cost of higher performance, and you probably should not use the atomic library at all.

See an example of the code below.

package main

import (
  "fmt"
  "sync/atomic"
  "time"
)

type countersWithoutSync struct {
  value atomic.Value
}

func (c *countersWithoutSync) increment(counter string) {
  counters := c.value.Load().(map[string]int)
  newCounters := make(map[string]int)
  for k, v := range counters {
   newCounters[k] = v
  }
  newCounters[counter]++
  c.value.Store(newCounters)
}

func (c *countersWithoutSync) printAndClean() {
  oldCounters := c.value.Swap(make(map[string]int)).(map[string]int)
  fmt.Printf("%v\n", oldCounters)
}

func main() {
  counters := countersWithoutSync{}
  counters.value.Store(make(map[string]int))

  for i := 0; i < 10; i++ {
   counterName := fmt.Sprintf("counter%v", i)
   go func(name string) {
    for iteration := 0; iteration < 10000; iteration++ {
     counters.increment(counterName)
     time.Sleep(time.Millisecond)
    }
   }(counterName)
  }

  for second := 0; second < 10; second++ {
   counters.printAndClean()
   time.Sleep(time.Second)
  }
}

We have a class that uses atomic.value to support parallel updates of counters. We might have expect that the atomic library would protect from parallel updates, but this is totally wrong. The atomic value only guarantees that we will not get concurrent modification errors for the pointer to the value, but once the pointer is returned, we're left without any parallel updates protection.

In this example, the increment method loads the atomic value, clones it to avoid parallel updates, and then stores it back. But we lack protection of another thread doing the same in parallel, so any other thread updates between our load and store API calls are lost.

The output of this example demonstrates the lost of updates. Instead of getting 1000 updates per counter every seconds, we get ~700 updates due to the parallel threads updates.

map[]
map[counter0:719 counter1:707 counter2:708 counter3:707 counter4:734 counter5:743 counter6:715 counter7:737 counter8:714 counter9:728]
map[counter0:743 counter1:734 counter2:754 counter3:757 counter4:743 counter5:743 counter6:742 counter7:749 counter8:738 counter9:754]
map[counter0:766 counter1:762 counter2:746 counter3:749 counter4:761 counter5:764 counter6:761 counter7:747 counter8:766 counter9:757]
map[counter0:780 counter1:761 counter2:777 counter3:787 counter4:772 counter5:777 counter6:777 counter7:782 counter8:790 counter9:784]
map[counter0:777 counter1:792 counter2:771 counter3:780 counter4:787 counter5:788 counter6:790 counter7:782 counter8:766 counter9:788]
map[counter0:767 counter1:775 counter2:780 counter3:755 counter4:775 counter5:768 counter6:760 counter7:772 counter8:767 counter9:759]
map[counter0:771 counter1:771 counter2:769 counter3:779 counter4:774 counter5:761 counter6:771 counter7:778 counter8:773 counter9:778]
map[counter0:776 counter1:766 counter2:764 counter3:776 counter4:763 counter5:782 counter6:764 counter7:775 counter8:764 counter9:766]
map[counter0:771 counter1:781 counter2:773 counter3:779 counter4:776 counter5:781 counter6:785 counter7:790 counter8:768 counter9:795]

One thing to add for this, is to ask why are we scared using sync.mutex? This might be a habit form other languages where the mutex might be not as efficient as in Go.

Using the atomic version of the code requires 0.5 microsecond per increment API call, while using a sync.mutex version uses 1 microsecond per increment API, hence the difference is not really meaningful in most cases.