Monday, January 29, 2024

Visualized Data Driven Development




In this post we review a combination of the two development methods: Data Driven Development and Visualized Data. This methods are the key value of an idea conversion into a real product.


A. The Fairy Tale

Once upon a time, a great developer had an idea: "I add my software component in the middle of network traffic and make something really good with it!". And so, the great developer had implemented his idea, placed his software component right in the great spot somewhere along the network traffic, and everything worked!


Ahh.. No...


These kind of stories are fairy tales, and do not exist in real life. When we have an idea, we are not aware to the full implications of the implementation, and our plan to cope with theoretical data finds unexpected behavior on the real data. Trying to implement and deploy such a software component tends to quickly and disgraceful fail.


B. Data Driven Development


B.1. Get The Data

To prepare for real data, we need to develop our software side by side with real data starting from day one. This means that we need to get hold of real data. This is possible if we're part of a software organization which already has several products running out there in the cloud. We need to get a tap/mirror of the data and save it for our application. For example, we can enable saving of the real data in AWS S3 for a small section of the organization customers. 

B.2. Secure The Data

Access to the real data has huge benefits, but also huge risks. Think about real network traffic data which contains credit card details, as well as medical information. We must use ALL of our means to secure access to the data.

B.3. Anonymization of the Data

Notice that this requires us to handle PII, and comply with the relevant country laws, such as GDPR.

One way to handle this is to anonymize the data before saving it. We can also save the data for a short period of time, and then delete it. This should be carefully handled, as a leak of customers' real data has a devastating implications for the software organization.

B.4. Simulation of Data Flow

Now that we have the data, and before starting to implement our software component, we should create a simulation wrapper. The simulation component reads the data from the saved location, and simulate running our software component as if it was actually running in the cloud in the production. This means that the simulation should stream the data into our software component.

B.5 Use the Same Source

An important thing to notice is that our simulation is a wrapper the the actual component source code, the same one running in the production. Do not mistake and have 2 sets of code for simulation and for production.


C. Visualized Data

Our software component does something (otherwise why does it exist?). It can for example periodically report an analysis of the data, or it can alter something in the data. Whatever it does, we need to be aware of this both as part of our simulation and as part of the production run. How should we check it is doing its job?


C.1. Logs - The Wrong Method

While logs might be fine for deep inspection of a problem, the logs are not suitable to check whether the software component does fulfill its purpose. There are many problems with the logs. 

  • Do we need to scan though thousands of log lines to find the related log lines that represents the status?
  • Do we plan to keep the verbose logs in the production, and pay the price of storing them and searching in the logs?
  • Can we show the logs to a non-software-engineer and explain the result?
These are rhetorical questions. The answer is hell no! We can use logs to specify errors, and a periodic infrequent prints, but using logs to check our solution is a bad practice.

C.2. GUI - The Right Method

We should include a GUI to present the status of our solution, and not just the end result, but the entire processing. While small and cheap software components might be fine with ready made GUI such as ElasticSearch and Kibana or Prometheus and Grafana, in most cases it would be wiser to create our own GUI to present the software component status since these pre-made tools are great for the first days, but their flexibility is limited for a long term solution. This interactive GUI should include graphs, histograms, texts and whatever our status requires to be clearly displayed. 

C.3. Save the Status

How do we provide the software component status? We dump once in a period: 1 second or 1 minute, depending on your needs. The dump is a simple JSON file that can we saved named by the data timestamp. This status is saved from the software component, always!
It means that no matter whether we're running from a simulation wrapper or from the production cloud, we get the same status JSON files that represent the software component status at a specific time.

C.4. Load the Status

To load the status, we need to... implement another component - the status loader. This is a backend application which reads the status JSON files for a specified period, analyzes and aggregates the statuses, and returns a response with the relevant graphs, histograms, and texts. This would probably be implemented as a HTTP REST based server.

C.5. Visualize the Status

And who send HTTP request to the status loader? The status visualization component. This is a JavaScript based application that present in an interactive and user friendly manner the responses from the status loader. We can easily implement such component, for example using reactredux. To display graphs and histograms, we can use some of the existing free libraries such as react-vis, google geomap, react-date-range, react-datepicker, react-dropdown, react-select, and many more.


D. Summary

To have our software component a first grade class product that can be used and maintained as a long term working solution we need to:
  1. Get real data
  2. Implement simulation wrapper to run the real software component code
  3. Export the status from our software component
  4. Implement a status loader 
  5. Implement a status visualization
  6. Test our software component using the simulation and real data
  7. Check results using the status loader and status visualization
  8. Fix issues, and rerun until we have good enough results
  9. Move to production for minimal deployment
  10. Check results using the status loader and status visualization (yes, the same tools should be used for production!)
  11. Fix issues, and rerun until we have good enough results

No comments:

Post a Comment