In this post we discuss finetune of a SentenceTransformer model. We've already presented a method of finetune for a torchvision based model, and in this post we will show a text embedding model finetune.
We start by presenting the related code.
from datasets import Dataset
from sentence_transformers import SentenceTransformer
from sentence_transformers import (
SentenceTransformerTrainer,
losses
)
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator
from sentence_transformers.similarity_functions import SimilarityFunction
from sentence_transformers.training_args import SentenceTransformerTrainingArguments
def finetune():
sentence1 = [
"Here is my horse",
"You are my light",
"I will be there in Monday",
"No way I kan do this",
"What is going on, who is it?",
]
sentence2 = [
"Here si my horse",
"You are my lihgt",
"I will be there on Monday",
"No way I can do that",
"What is went there, what was that?",
]
scores = [
1,
1,
1,
0.9,
0.5,
]
finetune_examples = Dataset.from_dict({
'sentence1': sentence1,
'sentence2': sentence2,
'score': scores,
})
model = SentenceTransformer(
"estrogen/ModernBERT-base-sbert-initialized",
trust_remote_code=True,
config_kwargs={"reference_compile": False}
)
model.gradient_checkpointing_enable()
model.max_seq_length = 4096
print('running finetune')
first_split = finetune_examples.train_test_split(test_size=0.333, shuffle=False)
train_dataset = first_split["train"]
first_split_test = first_split["test"]
second_split = first_split_test.train_test_split(test_size=0.5, shuffle=False)
eval_dataset = second_split["train"]
test_dataset = second_split["test"]
print(f'train size {train_dataset.shape[0]}')
print(f'eval size {eval_dataset.shape[0]}')
print(f'test size {test_dataset.shape[0]}')
train_loss = losses.CoSENTLoss(model=model)
dev_evaluator = EmbeddingSimilarityEvaluator(
sentences1=eval_dataset["sentence1"],
sentences2=eval_dataset["sentence2"],
scores=eval_dataset["score"],
main_similarity=SimilarityFunction.COSINE,
name="bla_dev_eval",
)
test_evaluator = EmbeddingSimilarityEvaluator(
sentences1=test_dataset["sentence1"],
sentences2=test_dataset["sentence2"],
scores=test_dataset["score"],
main_similarity=SimilarityFunction.COSINE,
name="bla_test_eval",
)
train_batch_size = 16
num_epochs = 4
args = SentenceTransformerTrainingArguments(
output_dir="output/training",
num_train_epochs=num_epochs,
per_device_train_batch_size=train_batch_size,
per_device_eval_batch_size=train_batch_size,
warmup_ratio=0.1,
fp16=True,
bf16=False,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
logging_steps=100,
run_name="API BLA AI Sequences COSINE loss",
)
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
loss=train_loss,
evaluator=dev_evaluator,
)
print('training')
trainer.train()
dev_results = dev_evaluator(model)
print('dev evaluation')
print(dev_results)
test_results = test_evaluator(model)
print('test evaluation')
print(test_results)
output_folder_path = "output/tuned_model"
model.save(output_folder_path)
finetune()
The code is pretty straightforward: we create examples for embedding scoring, each example includes two sentences, and a score indicating how related are these sentences. A score of 1 means the sentences are identical, and a score of zero means the are not related.
The real art is not in the code, but in the finetune examples generation. Here are some rules I've found useful:
- Do not use the same data that the model is expected to run embedding for. We want to avoid over-fitting of the model to the actual data, and instead we want to make it understand the general idea of the expected similarity.
- The finetune examples should include the full spectrum of the expected behavior. We need to supply examples where the score is one, examples where the score is zero, and the entire range between these.
- In case you generate the text for embedding, try to generate a succinct text. Long text confuses the model, and tends to require more finetune.
- Create metrics to visualize what had the model learnt. In case we have different kinds of ideas we want to teach the model, run the model after the finetune of any type of this ideas, multiple examples for each idea, and compare the expected score vs. the actual score for each idea. Notice the score does not have to match, but instead we expect a general leveling of the score. For example:
The expected score for idea-1 is 0.9.
The expected score for idea-2 is 0.8.
The expected score for idea-3 is 0.5.
Then we might find the following good enough:
The actual score for idea-1 is 0.8.
The actual score for idea-2 is 0.6.
The actual score for idea-3 is 0.3. - Supply enough samples for the finetune. From my experience it is somewhere between 10K to 30K examples.