Fine-Tune Embeddings for Your Domain

Boost RAG retrieval accuracy by 30-50% with domain-specific fine-tuning. Learn to create custom embeddings for your documents and queries.

Author
Ailog Research Team
Published
Reading time
14 min read
Level
advanced
RAG Pipeline Step
Embedding

Why Fine-Tune?

Generic embeddings work well, but domain-specific fine-tuning gives 30-50% accuracy boost:

Before (generic): • Medical query: "MI treatment" → ❌ matches "Michigan"

After (fine-tuned): • Medical query: "MI treatment" → ✅ matches "Myocardial Infarction protocols"

When to Fine-Tune

✅ Fine-tune when: • Domain-specific jargon (legal, medical, technical) • 1000+ labeled query-document pairs • Base model underperforms (< 70% recall)

❌ Skip fine-tuning when: • General domain • < 500 training examples • Base model already works well

Training Data Format

``python Positive pairs (query → relevant document) train_data = [ { "query": "What causes diabetes?", "positive": "Type 2 diabetes is caused by insulin resistance...", "negative": "Diabetic retinopathy affects the eyes..." Optional }, { "query": "How to lower blood pressure?", "positive": "Lifestyle changes like diet and exercise reduce BP...", "negative": "High blood pressure symptoms include headaches..." } ] `

Method 1: Sentence Transformers

`python from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader

Load base model model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Prepare training data train_examples = [ InputExample(texts=[item['query'], item['positive']]) for item in train_data ]

Create dataloader train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

Define loss (contrastive learning) train_loss = losses.MultipleNegativesRankingLoss(model)

Fine-tune model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path='./fine-tuned-model' ) `

Method 2: OpenAI Fine-Tuning

`python import openai

Prepare data in JSONL format with open('training_data.jsonl', 'w') as f: for item in train_data: f.write(json.dumps({ "input": item['query'], "output": item['positive'] }) + '\n')

Upload training file file = openai.File.create( file=open("training_data.jsonl", "rb"), purpose='fine-tune' )

Create fine-tuning job job = openai.FineTuningJob.create( training_file=file.id, model="text-embedding-3-small" )

Wait for completion status = openai.FineTuningJob.retrieve(job.id) print(status.status) 'succeeded'

Use fine-tuned model embeddings = openai.Embedding.create( input="your query", model=f"ft:{job.fine_tuned_model}" ) `

Method 3: Hard Negative Mining

Improve contrastive learning with hard negatives:

`python from sentence_transformers import losses

Generate hard negatives (similar but irrelevant documents) def mine_hard_negatives(query, candidates, model, k=5): query_emb = model.encode(query) cand_embs = model.encode(candidates)

Find most similar but incorrect documents scores = cosine_similarity([query_emb], cand_embs)[0] hard_neg_indices = np.argsort(scores)[-k:]

return [candidates[i] for i in hard_neg_indices]

Training with hard negatives train_examples = [] for item in train_data: hard_negs = mine_hard_negatives( item['query'], all_documents, base_model )

for neg in hard_negs: train_examples.append( InputExample(texts=[ item['query'], item['positive'], neg Hard negative ]) )

Use TripletLoss train_loss = losses.TripletLoss(model) `

Evaluation

`python from sklearn.metrics import ndcg_score

def evaluate_model(model, test_queries, test_docs, relevance_labels): predictions = []

for query in test_queries: query_emb = model.encode(query) doc_embs = model.encode(test_docs) scores = cosine_similarity([query_emb], doc_embs)[0] predictions.append(scores)

nDCG@10 ndcg = ndcg_score(relevance_labels, predictions, k=10)

return ndcg

Compare base vs fine-tuned base_model = SentenceTransformer('all-MiniLM-L6-v2') fine_tuned_model = SentenceTransformer('./fine-tuned-model')

print(f"Base model nDCG@10: {evaluate_model(base_model, ...)}") print(f"Fine-tuned nDCG@10: {evaluate_model(fine_tuned_model, ...)}") `

Incremental Fine-Tuning

Update model as new data arrives:

`python Load previously fine-tuned model model = SentenceTransformer('./fine-tuned-model')

Add new training data new_train_examples = [...]

Continue training (warm start) model.fit( train_objectives=[(new_dataloader, train_loss)], epochs=1, warmup_steps=50, output_path='./fine-tuned-model-v2' ) `

Distillation (Fast Inference)

Fine-tune large model, then distill to small one:

`python from sentence_transformers import models, SentenceTransformer, losses

Teacher: large fine-tuned model teacher = SentenceTransformer('fine-tuned-large-model')

Student: small base model student = SentenceTransformer('all-MiniLM-L6-v2')

Distillation loss train_loss = losses.MSELoss(student, teacher)

Train student to mimic teacher student.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3 )

Now student is fast but performs like teacher ``

Fine-tuning embeddings is the secret weapon for domain-specific RAG. Invest in it early.

Tags

  • embedding
  • fine-tuning
  • custom
  • domain-specific
3. EmbeddingAvancé

Fine-Tune Embeddings for Your Domain

17 novembre 2025
14 min read
Ailog Research Team

Boost RAG retrieval accuracy by 30-50% with domain-specific fine-tuning. Learn to create custom embeddings for your documents and queries.

Why Fine-Tune?

Generic embeddings work well, but domain-specific fine-tuning gives 30-50% accuracy boost:

Before (generic):

  • Medical query: "MI treatment" → ❌ matches "Michigan"

After (fine-tuned):

  • Medical query: "MI treatment" → ✅ matches "Myocardial Infarction protocols"

When to Fine-Tune

Fine-tune when:

  • Domain-specific jargon (legal, medical, technical)
  • 1000+ labeled query-document pairs
  • Base model underperforms (< 70% recall)

Skip fine-tuning when:

  • General domain
  • < 500 training examples
  • Base model already works well

Training Data Format

DEVELOPERpython
# Positive pairs (query → relevant document) train_data = [ { "query": "What causes diabetes?", "positive": "Type 2 diabetes is caused by insulin resistance...", "negative": "Diabetic retinopathy affects the eyes..." # Optional }, { "query": "How to lower blood pressure?", "positive": "Lifestyle changes like diet and exercise reduce BP...", "negative": "High blood pressure symptoms include headaches..." } ]

Method 1: Sentence Transformers

DEVELOPERpython
from sentence_transformers import SentenceTransformer, InputExample, losses from torch.utils.data import DataLoader # Load base model model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2') # Prepare training data train_examples = [ InputExample(texts=[item['query'], item['positive']]) for item in train_data ] # Create dataloader train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16) # Define loss (contrastive learning) train_loss = losses.MultipleNegativesRankingLoss(model) # Fine-tune model.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3, warmup_steps=100, output_path='./fine-tuned-model' )

Method 2: OpenAI Fine-Tuning

DEVELOPERpython
import openai # Prepare data in JSONL format with open('training_data.jsonl', 'w') as f: for item in train_data: f.write(json.dumps({ "input": item['query'], "output": item['positive'] }) + '\n') # Upload training file file = openai.File.create( file=open("training_data.jsonl", "rb"), purpose='fine-tune' ) # Create fine-tuning job job = openai.FineTuningJob.create( training_file=file.id, model="text-embedding-3-small" ) # Wait for completion status = openai.FineTuningJob.retrieve(job.id) print(status.status) # 'succeeded' # Use fine-tuned model embeddings = openai.Embedding.create( input="your query", model=f"ft:{job.fine_tuned_model}" )

Method 3: Hard Negative Mining

Improve contrastive learning with hard negatives:

DEVELOPERpython
from sentence_transformers import losses # Generate hard negatives (similar but irrelevant documents) def mine_hard_negatives(query, candidates, model, k=5): query_emb = model.encode(query) cand_embs = model.encode(candidates) # Find most similar but incorrect documents scores = cosine_similarity([query_emb], cand_embs)[0] hard_neg_indices = np.argsort(scores)[-k:] return [candidates[i] for i in hard_neg_indices] # Training with hard negatives train_examples = [] for item in train_data: hard_negs = mine_hard_negatives( item['query'], all_documents, base_model ) for neg in hard_negs: train_examples.append( InputExample(texts=[ item['query'], item['positive'], neg # Hard negative ]) ) # Use TripletLoss train_loss = losses.TripletLoss(model)

Evaluation

DEVELOPERpython
from sklearn.metrics import ndcg_score def evaluate_model(model, test_queries, test_docs, relevance_labels): predictions = [] for query in test_queries: query_emb = model.encode(query) doc_embs = model.encode(test_docs) scores = cosine_similarity([query_emb], doc_embs)[0] predictions.append(scores) # nDCG@10 ndcg = ndcg_score(relevance_labels, predictions, k=10) return ndcg # Compare base vs fine-tuned base_model = SentenceTransformer('all-MiniLM-L6-v2') fine_tuned_model = SentenceTransformer('./fine-tuned-model') print(f"Base model nDCG@10: {evaluate_model(base_model, ...)}") print(f"Fine-tuned nDCG@10: {evaluate_model(fine_tuned_model, ...)}")

Incremental Fine-Tuning

Update model as new data arrives:

DEVELOPERpython
# Load previously fine-tuned model model = SentenceTransformer('./fine-tuned-model') # Add new training data new_train_examples = [...] # Continue training (warm start) model.fit( train_objectives=[(new_dataloader, train_loss)], epochs=1, warmup_steps=50, output_path='./fine-tuned-model-v2' )

Distillation (Fast Inference)

Fine-tune large model, then distill to small one:

DEVELOPERpython
from sentence_transformers import models, SentenceTransformer, losses # Teacher: large fine-tuned model teacher = SentenceTransformer('fine-tuned-large-model') # Student: small base model student = SentenceTransformer('all-MiniLM-L6-v2') # Distillation loss train_loss = losses.MSELoss(student, teacher) # Train student to mimic teacher student.fit( train_objectives=[(train_dataloader, train_loss)], epochs=3 ) # Now student is fast but performs like teacher

Fine-tuning embeddings is the secret weapon for domain-specific RAG. Invest in it early.

Tags

embeddingfine-tuningcustomdomain-specific

Articles connexes

Ailog Assistant

Ici pour vous aider

Salut ! Pose-moi des questions sur Ailog et comment intégrer votre RAG dans vos projets !