Sunday, January 26, 2025

Finetune an Embedding SentenceTransformer

In this post we discuss finetune of a SentenceTransformer model. We've already presented a method of finetune for a torchvision based model, and in this post we will show a text embedding model finetune.

We start by presenting the related code.

from datasets import Dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers import (
    SentenceTransformerTrainer,
    losses
)
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.similarity_functions import SimilarityFunction
from sentence_transformers.training_args import SentenceTransformerTrainingArguments


def finetune():
    sentence1 = [
        "Here is my horse",
        "You are my light",
        "I will be there in Monday",
        "No way I kan do this",
        "What is going on, who is it?",
    ]
    sentence2 = [
        "Here si my horse",
        "You are my lihgt",
        "I will be there on Monday",
        "No way I can do that",
        "What is went there, what was that?",
    ]
    scores = [
        1,
        1,
        1,
        0.9,
        0.5,
    ]
    finetune_examples = Dataset.from_dict({
        'sentence1': sentence1,
        'sentence2': sentence2,
        'score': scores,
    })

    model = SentenceTransformer(
        "estrogen/ModernBERT-base-sbert-initialized",
        trust_remote_code=True,
        config_kwargs={"reference_compile": False}
    )
    model.gradient_checkpointing_enable()
    model.max_seq_length = 4096

    print('running finetune')
    first_split = finetune_examples.train_test_split(test_size=0.333, shuffle=False)
    train_dataset = first_split["train"]
    first_split_test = first_split["test"]

    second_split = first_split_test.train_test_split(test_size=0.5, shuffle=False)

    eval_dataset = second_split["train"]
    test_dataset = second_split["test"]
    print(f'train size {train_dataset.shape[0]}')
    print(f'eval size {eval_dataset.shape[0]}')
    print(f'test size {test_dataset.shape[0]}')

    train_loss = losses.CoSENTLoss(model=model)

    dev_evaluator = EmbeddingSimilarityEvaluator(
        sentences1=eval_dataset["sentence1"],
        sentences2=eval_dataset["sentence2"],
        scores=eval_dataset["score"],
        main_similarity=SimilarityFunction.COSINE,
        name="bla_dev_eval",
    )
    test_evaluator = EmbeddingSimilarityEvaluator(
        sentences1=test_dataset["sentence1"],
        sentences2=test_dataset["sentence2"],
        scores=test_dataset["score"],
        main_similarity=SimilarityFunction.COSINE,
        name="bla_test_eval",
    )

    train_batch_size = 16
    num_epochs = 4

    args = SentenceTransformerTrainingArguments(
        output_dir="output/training",
        num_train_epochs=num_epochs,
        per_device_train_batch_size=train_batch_size,
        per_device_eval_batch_size=train_batch_size,
        warmup_ratio=0.1,
        fp16=True,
        bf16=False,
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=2,
        logging_steps=100,
        run_name="API BLA AI Sequences COSINE loss",
    )

    trainer = SentenceTransformerTrainer(
        model=model,
        args=args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        loss=train_loss,
        evaluator=dev_evaluator,
    )
    print('training')
    trainer.train()

    dev_results = dev_evaluator(model)
    print('dev evaluation')
    print(dev_results)

    test_results = test_evaluator(model)
    print('test evaluation')
    print(test_results)

    output_folder_path = "output/tuned_model"
    model.save(output_folder_path)


finetune()

The code is pretty straightforward: we create examples for embedding scoring, each example includes two sentences, and a score indicating how related are these sentences. A score of 1 means the sentences are identical, and a score of zero means the are not related.

The real art is not in the code, but in the finetune examples generation. Here are some rules I've found useful:

Do not use the same data that the model is expected to run embedding for. We want to avoid over-fitting of the model to the actual data, and instead we want to make it understand the general idea of the expected similarity.
The finetune examples should include the full spectrum of the expected behavior. We need to supply examples where the score is one, examples where the score is zero, and the entire range between these.
In case you generate the text for embedding, try to generate a succinct text. Long text confuses the model, and tends to require more finetune.
Create metrics to visualize what had the model learnt. In case we have different kinds of ideas we want to teach the model, run the model after the finetune of any type of this ideas, multiple examples for each idea, and compare the expected score vs. the actual score for each idea. Notice the score does not have to match, but instead we expect a general leveling of the score. For example:
The expected score for idea-1 is 0.9.
The expected score for idea-2 is 0.8.
The expected score for idea-3 is 0.5.

Then we might find the following good enough:
The actual score for idea-1 is 0.8.
The actual score for idea-2 is 0.6.
The actual score for idea-3 is 0.3.
Supply enough samples for the finetune. From my experience it is somewhere between 10K to 30K examples.

Sunday, January 19, 2025

Gunicorn

In this post we review usage of the python framework Guicorn.

Gunicorn provides a concurrent processing of sockets request, while using a polling mode that enables the same worker to handle multiple requests.

A Simple Flask Application

Gunicorn can be used to wrap the flask application. For example, let's assume we have the following flask application code:

main.py

from flask import Flask

app = Flask(__name__)

@app.route("/")
def index():
    return "Hello World"

if __name__ == "__main__":
    app.run(host='0.0.0.0', port=8080)

We deploy the dependencies:

pip install flask

and run the application:

python main.py

Let's stress test the flask application without gunicorn, first install the apache stress tool:

sudo apt install apache2-utils

and next run the stress with 10K requests and concurrency of 10 threads:

ab -n 10000 -c 10 http://localhost:8080/

The results are:

Wrap the Flask with Gunicorn

To wrap the flask application with gunicorn, we deploy:

pip install gunicorn

and run gunicorn:

gunicorn --workers=2 'main:app'

Notice that guicorn starts by default on the port 8000, so let's run the stress test:

ab -n 10000 -c 10 http://localhost:8000/

The results are:

And we've got ~10 times better results, which is pretty amazing.

Final Note

Using guicorn seems to make python run like a real multithreaded application, however we must be aware to the fact that gunicorn spawns workers processes, hence swe cannot share memory between the workers.

Using guicorn is a nice solution if we do not have the resources to rewrite the application in GO, but notice that running gunicorn in kubernetes is a bad practice as the guideline is:

one container == one process

and gunicorn does not follow it. Still, it is reasonable solution for non-critical services.

Wednesday, January 1, 2025

Setting Up a Publicly Accessible VM with Docker, Nginx, and SSL on GCP

In this post we review the step to setup a publicly accessible web site.

The web site is based on a docker container running the famous juice-shop in a GCP based VM.

We use Let's Encrypt to produce a valid SSL certificate for the site.

All the steps below are using "demo" prefix for the entities. Make sure to use your own suitable prefix instead.

GCP Steps

Add VPC

Create a new VPC network named demo-vpc.

use IPv4
add a subnet
add Firewall rules to allow TCP ports 22 (SSH), 80 (HTTP), 443(HTTPS)

Add VM

Open the GCP compute engine service.

Add new VM named demo-vm.

Stop the VM, and wait for the stopping to complete.

Edit the VM, and update the VM network interfaces to use the demo-vpc.

Start the VM.

Open the GCP VPC network service, and select IP addresses.

Reserve new external static IP named demo-static-ip, and assign it to the VM.

Add DNS

Open the GCP cloud domains service.

Add new domain registration named demo.com, and complete the verification process.

Open the GCP networking service, and select cloud DNS.

Click on the demo-com zone.

Add a standard A-record www.demo.com, and set the IP to the value of demo-static-ip.

Site Steps

Create Docker Compose

Open the GCP compute engine service.

Click on the demo-vm, and connect using SSH button.

Install docker on the machine, and enable non-root access.

Create docker-compose.yaml file:

version: '3'

services:
  juiceshop:
    image: bkimminich/juice-shop
    container_name: juiceshop
    environment:
      - NODE_ENV=production
    ports:
      - "3000:3000"
    restart: always

  nginx:
    image: nginx:latest
    container_name: nginx
    volumes:
      - ./nginx.conf:/etc/nginx/nginx.conf
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
    ports:
      - "80:80"
      - "443:443"
    depends_on:
      - juiceshop
    restart: always

  certbot:
    image: certbot/certbot
    container_name: certbot
    volumes:
      - ./data/certbot/conf:/etc/letsencrypt
      - ./data/certbot/www:/var/www/certbot
    entrypoint: "/bin/sh -c 'trap exit TERM; while :; do certbot renew; sleep 12h & wait $${!}; done;'"
    restart: always

Create NGINX

Create nginx.conf file:


events {
    use     epoll;
    worker_connections  128;
}

error_log   /var/log/nginx.log info;

http {


  server {
    listen 80;
    server_name www.demo.com;

    location / {
      proxy_pass http://juiceshop:3000;
      proxy_set_header Host $host;
      proxy_set_header X-Real-IP $remote_addr;
      proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
      proxy_set_header X-Forwarded-Proto $scheme;
    }
     
    location /.well-known/acme-challenge/ {
      root /var/www/certbot;
    }
  }

  server {
      listen 443;
      server_name www.demo.com;
      #replace with this block later
      #listen 443 ssl;
      #server_name www.demo.com;
      #ssl_certificate /etc/letsencrypt/live/www.demo.com/fullchain.pem;
      #ssl_certificate_key /etc/letsencrypt/live/www.demo.com/privkey.pem;

      location / {
    proxy_pass http://juiceshop:3000;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header X-Forwarded-Proto $scheme;
      }
  }
}

Notice that the nginx.conf SSL section is commented out. We will revive it after issuing a valid certificate.

Create SSL Certification

Start the containers:

docker compose up -d

Initiate a certificate request:

docker-compose exec certbot certbot certonly --webroot --webroot-path=/var/www/certbot -d www.demo.com --email your-email@demo.com --agree-tos --non-interactive

Update the nginx.conf, and revive the commented SSL section.

Restart the containers:

docker compose restart

Final Note

I have created this post since that is a common practice required for many engineer, but the information is scattered over the sites with many mistakes.

When following these steps, make sure to validate that each step has successfully completed. For example - check connection to the public IP from a client machine, check DNS resolution, etc.

Service Extensions

I've recently worked on a project with GCP service extensions. The service extensions has 2 modes.

Service Extensions Plugins

The first mode is service extensions plugins that are WASM based plugins that are run within the load balancer context. This is useful for stateless validation and updates to the request and response.

The plugins can be implemented using C++ and Rust, which is ok as long as you have a small picece of code. This mode is very similar to the AWS lambda.

The advantage is we do not need to manage compute resources, it is done as part of the load balancer itself, which is great.

The disadvantage is that we can have only a short small stateless piece of code. Forget about complex logic here.

Service Extensions Callouts

The second mode us service extensions callouts which is a deployment running out of the load balancer context. This callouts backend can be based for example on (VM) instances groups, and on a kubernetes cluster deployment.

The advatage is that we get a much flexible implementation choices, and we can manage state and complex logic.

The disadvantage is that we need to manage the compute resources. This is not just managing a single deployment, but managing a deploying in each region the load balancer is working.

For example if we have a global load balancer, and the callouts backend is a kubernetes based solution, we need to have a deployment in each of the load balancer regions, and since GCP supports 40 regions, we need 40(!) kubernetes clusters.

Service Extenstion Callouts Example

Following the GCP documents, we can create a simple working example of the service extension callouts for an internal regional load balancer. By the way, it is obvious why the example is for a regional load balancer, and not for a global load balancer that would require 40 deployments.

The creation is based on gcloud CLI. We start by creating a load balancer over a VMs instance group backend.

#VPC
gcloud compute networks create lb-network --subnet-mode=custom

gcloud compute networks subnets create backend-subnet \
    --network=lb-network \
    --range=10.1.2.0/24 \
    --region=us-west1
    
gcloud compute networks subnets create europe-subnet \
    --network=lb-network \
    --range=10.3.4.0/24 \
    --region=europe-west1
    
gcloud compute networks subnets create proxy-only-subnet \
  --purpose=REGIONAL_MANAGED_PROXY \
  --role=ACTIVE \
  --region=us-west1 \
  --network=lb-network \
  --range=10.129.0.0/23
  

# Firewall Rules

gcloud compute firewall-rules create fw-allow-ssh \
    --network=lb-network \
    --action=allow \
    --direction=ingress \
    --target-tags=allow-ssh \
    --rules=tcp:22

gcloud compute firewall-rules create fw-allow-health-check \
    --network=lb-network \
    --action=allow \
    --direction=ingress \
    --source-ranges=130.211.0.0/22,35.191.0.0/16 \
    --target-tags=load-balanced-backend \
    --rules=tcp        
    
gcloud compute firewall-rules create fw-allow-proxies \
  --network=lb-network \
  --action=allow \
  --direction=ingress \
  --source-ranges=10.129.0.0/23 \
  --target-tags=load-balanced-backend \
  --rules=tcp:80,tcp:443,tcp:8080
  
# IP for the load balancer

gcloud compute addresses create l7-ilb-ip-address \
  --region=us-west1 \
  --subnet=backend-subnet
  
gcloud compute addresses describe l7-ilb-ip-address \
  --region=us-west1    
  

# managed vm instance group
gcloud compute instance-templates create l7-ilb-backend-template \
--region=us-west1 \
--network=lb-network \
--subnet=backend-subnet \
--tags=allow-ssh,load-balanced-backend \
--image-family=debian-12 \
--image-project=debian-cloud \
--metadata=startup-script='#! /bin/bash
  apt-get update
  apt-get install apache2 -y
  a2ensite default-ssl
  a2enmod ssl
  vm_hostname="$(curl -H "Metadata-Flavor:Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/name)"
  echo "Page served from: $vm_hostname" | \
  tee /var/www/html/index.html
  systemctl restart apache2'
  
gcloud compute instance-groups managed create l7-ilb-backend-example \
    --zone=us-west1-a \
    --size=2 \
    --template=l7-ilb-backend-template
    

# load balancer
gcloud compute health-checks create http l7-ilb-basic-check \
   --region=us-west1 \
   --use-serving-port
   
gcloud compute backend-services create l7-ilb-backend-service \
  --load-balancing-scheme=INTERNAL_MANAGED \
  --protocol=HTTP \
  --health-checks=l7-ilb-basic-check \
  --health-checks-region=us-west1 \
  --region=us-west1
  
gcloud compute backend-services add-backend l7-ilb-backend-service \
  --balancing-mode=UTILIZATION \
  --instance-group=l7-ilb-backend-example \
  --instance-group-zone=us-west1-a \
  --region=us-west1     

gcloud compute url-maps create l7-ilb-map \
  --default-service=l7-ilb-backend-service \
  --region=us-west1


gcloud compute target-http-proxies create l7-ilb-proxy \
  --url-map=l7-ilb-map \
  --url-map-region=us-west1 \
  --region=us-west1   
  
  
gcloud compute forwarding-rules create l7-ilb-forwarding-rule \
  --load-balancing-scheme=INTERNAL_MANAGED \
  --network=lb-network \
  --subnet=backend-subnet \
  --address=l7-ilb-ip-address \
  --ports=80 \
  --region=us-west1 \
  --target-http-proxy=l7-ilb-proxy \
  --target-http-proxy-region=us-west1

Next we create a test VM that is used the send requests to the internal load balancer.

gcloud compute instances create l7-ilb-client-us-west1-a \
    --image-family=debian-12 \
    --image-project=debian-cloud \
    --network=lb-network \
    --subnet=backend-subnet \
    --zone=us-west1-a \
    --tags=allow-ssh
    

gcloud compute addresses describe l7-ilb-ip-address \
    --region=us-west1
    
gcloud compute ssh l7-ilb-client-us-west1-a \
    --zone=us-west1-a    

# RUN FROM THE TEST VM
#curl -D - -H "host: example.com" http://IP OF LOAD BALANCER

Last we create a callout backend with a rule to handle hostname example.com.

# create callout backend

gcloud compute instances create-with-container callouts-vm \
  --container-image=us-docker.pkg.dev/service-extensions/ext-proc/service-callout-basic-example-python:latest \
  --network=lb-network \
  --subnet=backend-subnet \
  --zone=us-west1-a \
  --tags=allow-ssh,load-balanced-backend
  
gcloud compute instance-groups unmanaged create callouts-ig \
  --zone=us-west1-a       
       
gcloud compute instance-groups unmanaged set-named-ports callouts-ig \
  --named-ports=http:80,grpc:443 \
  --zone=us-west1-a       
       
gcloud compute instance-groups unmanaged add-instances callouts-ig \
  --zone=us-west1-a \
  --instances=callouts-vm       
       

# update the load balancer       

gcloud compute health-checks create http callouts-hc \
  --region=us-west1 \
  --port=80        

gcloud compute backend-services create l7-ilb-callout-service \
  --load-balancing-scheme=INTERNAL_MANAGED \
  --protocol=HTTP2 \
  --port-name=grpc \
  --health-checks=callouts-hc \
  --health-checks-region=us-west1 \
  --region=us-west1

gcloud compute backend-services add-backend l7-ilb-callout-service \
  --balancing-mode=UTILIZATION \
  --instance-group=callouts-ig \
  --instance-group-zone=us-west1-a \
  --region=us-west1
  
```


# Traffic extension

```bash

cat >traffic.yaml <<EOF
    name: traffic-ext
    forwardingRules:
    - https://www.googleapis.com/compute/v1/projects/radware-cto/regions/us-west1/forwardingRules/l7-ilb-forwarding-rule
    loadBalancingScheme: INTERNAL_MANAGED
    metadata: {"fr": "{forwarding_rule_id}", "key2": {"key3":"value"}}
    extensionChains:
    - name: "chain1"
      matchCondition:
        celExpression: 'request.host == "example.com"'
      extensions:
      - name: 'ext11'
        authority: ext11.com
        service: https://www.googleapis.com/compute/v1/projects/radware-cto/regions/us-west1/backendServices/l7-ilb-callout-service
        failOpen: false
        timeout: 0.1s
        supportedEvents:
        - RESPONSE_HEADERS
EOF


gcloud service-extensions lb-traffic-extensions import traffic-ext \
    --source=traffic.yaml \
    --location=us-west1

The callout backend is based on the example in https://github.com/GoogleCloudPlatform/service-extensions. I would recommend starting with this Go or python based example as a first step.

Full Blog TOC

Full Blog Table Of Content with Keywords Available HERE

Sunday, January 26, 2025

Finetune an Embedding SentenceTransformer

Sunday, January 19, 2025

Gunicorn

A Simple Flask Application

Wrap the Flask with Gunicorn

Final Note

Wednesday, January 1, 2025

Setting Up a Publicly Accessible VM with Docker, Nginx, and SSL on GCP

GCP Steps

Add VPC

Add VM

Add DNS

Site Steps

Create Docker Compose

Create NGINX

Create SSL Certification

Final Note

Service Extensions

Service Extensions Plugins

Service Extensions Callouts

Service Extenstion Callouts Example