run KISS: Google SRE - Applied Changes

Lately I have been reading the book Building Secure and Reliable Systems:

This book is very relevant for the current project, which had started about a year ago, and now is starting to acquire customers, hence we are looking for the principles of stability and security. These terms are not new to me, but this book has an interesting point of view combining the security and the stability terms in the same methodology.

Our project is using a kubernetes cloud based platform, and based on the methods presented in the book, I've made the following changes to the project.

Kibana View Only User

We are using Kibana and ElasticSearch to view the application status. We've had previously used ElasticSearch in a non secured mode, while counting on our authentication service to block unauthorized users, and on the Kubernetes Ingress to encrypt the traffic. But we have found that some of our users should only use the dashboards, and we do not want them to be able to update the Kibana dashboards. Hence we have started with TLS configuration the the ElasticSearch, which later allowed us to add a view only user in kibana. This can be automated using the following script:

#!/usr/bin/env bash

AUTH_ARG="-u elastic:mypassword"

function createRole(){
cat << EOF > ./input.json
{
  "elasticsearch":{
    "cluster":[],
    "indices":[
      {
        "names":["*"],
        "privileges":["read"],
        "allow_restricted_indices":false
      }
    ]
  },
  "kibana":[
    {
      "base":["read"],
      "spaces":["default"]
    }
  ]
}
EOF
    curl ${AUTH_ARG} -s -X PUT -H 'kbn-xsrf: true' -H 'Content-Type: application/json' 'http://localhost:5601/api/security/role/my_viewer_role' --data-binary "@input.json"
}

function createUser(){
cat << EOF > ./input.json
{
  "password": "myviewpassword",
  "roles": ["my_viewer_role"]
}
EOF
    curl ${AUTH_ARG} -s -k -X POST  -H 'Content-Type: application/json' ${ELASTICSEARCH_HOSTS}/_security/user/myview --data-binary "@input.json"
}

createRole
createUser

Pods Anti Affinity

We want to ensure that in case a kubernetes node crashes, we will not have one of our microservices down. This is done by starting at least 2 replicas of each critical microservice, and using anti affinity rule to request kubernetes not to schedule 2 pods of the same microservice on the same node. Notice that in this case, since the kubernetes nodes amount is not very high, we set the anti affinity as a recommendation, not as an enforcement.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-deployment
spec:
  replicas: 2
  selector:
    matchLabels:
      configid: my-container
  template:
    metadata:
      labels:
        configid: my-container
    spec:
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchExpressions:
                    - key: configid
                      operator: In
                      values:
                        - my-container
      containers:
        - name: ...

Jenkins Merge Job

For some time, to update the production environment, we have manually merged the GIT dev branch to the master branch. As we wanted to reduce manual action items, and mistakes risks, we have added a merge job to automate the merge.

#!/usr/bin/env bash

git checkout -B dev
git pull

git checkout -B master
git pull

git merge dev -m "merge by jenkins"
git push --set-upstream ....

Auditing

Auditing enables use to find problems and malicious actions. We have added the following auditing.

Audit log of successful and failed logins to the authentication service
Audit of user name and the change in the management service

Fail-Safe

We've have some cases where our Redis DB was down. While we strive to fully prevent some cases, we cannot always fix all corner cases. We have decided that in case our DB is down, we will change the mode of the services as a "pass-through", meaning that we will change the mode of the services to enable any possible operation without accessing the DB. This leave the system as a partly functioning system, but this is better than a fully non operational system.

Final Note

This are only the first steps toward a more secure and stable product, I will keep updating in the future as I progress in the book reading, and apply more changes.

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Wednesday, June 16, 2021

Google SRE - Applied Changes