NoDuplicatesDataLoader Compatability with Asymmetric models #3220

OsamaS99 · 2025-02-07T10:04:18Z

Currently, I'm working with an asymmetric model, MNRL and providing the InputExamples as follows as in the documentation.
texts = [{"QRY": item["query"]}, {"PROD": item["titles"][0]}] + [{"PROD": t} for t in item["titles"][1:]]
Using NoDuplicatesDataLoader resulted in the following error AttributeError: 'dict' object has no attribute 'strip'

@tomaarsen

tomaarsen · 2025-02-11T11:11:03Z

Hello!

The Asym module is not very commonly used at all, but it should indeed still work. I can reproduce your issue, and your fix seems to work correctly.

I also want to share that with the v3 update, we switched to using a new approach for training using a SentenceTransformerTrainer. The old approach (with model.fit and InputExample) still works, but I'd recommend switching to the newer one. It simply gives you a bit more control (docs).

I also think that this is a good moment to update the documentation example for Asym. I'll push that into this PR, and then I think it's ready to be merged. Feel free to let me know what you think!

Tom Aarsen

tomaarsen · 2025-02-11T11:19:06Z

Oh, also, I trained 2 models (with Asym and without Asym), and the Asym model was much worse, I'm afraid:

Here is the script that I used:

import random
import logging
from datasets import load_dataset, Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import models

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
    "nreimers/MiniLM-L6-H384-uncased",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on AllNLI triplets",
    ),
)
'''
asym_model = models.Asym(
    {
        "query": [models.Dense(model.get_sentence_embedding_dimension(), model.get_sentence_embedding_dimension())],
        "doc": [models.Dense(model.get_sentence_embedding_dimension(), model.get_sentence_embedding_dimension())],
    }
)
model.add_module("asym", asym_model)
# '''

# 3. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/gooaq", split="train")
dataset = dataset.add_column("id", range(len(dataset)))
dataset_dict = dataset.train_test_split(test_size=10_000, seed=12)
train_dataset: Dataset = dataset_dict["train"].select(range(500_000))
eval_dataset: Dataset = dataset_dict["test"]

'''
def mapper(sample):
    return {
        "question": {"query": sample["question"]},
        "answer": {"doc": sample["answer"]},
    }

train_dataset = train_dataset.map(mapper)
eval_dataset = eval_dataset.map(mapper)
# '''

# 4. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 5. (Optional) Specify training arguments
# run_name = "MiniLM-L6-H384-uncased-gooaq-asym"
run_name = "MiniLM-L6-H384-uncased-gooaq-no-asym"
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/mpnet-base-gooaq",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    logging_steps=50,
    logging_first_step=True,
    run_name=run_name,  # Will be used in W&B if `wandb` is installed
    seed=24,
)

# 6. (Optional) Create an evaluator & evaluate the base model
# The full corpus, but only the evaluation queries
random.seed(12)
queries = dict(zip(eval_dataset["id"], eval_dataset["question"]))
# queries = {
#     qid: {"query": question}
#     for qid, question in zip(eval_dataset["id"], eval_dataset["question"])
# }
# corpus = (
#     {qid: {"doc": dataset[qid]["answer"]} for qid in queries} |
#     {qid: {"doc": dataset[qid]["answer"]} for qid in random.sample(range(len(dataset)), 20_000)}
# )
corpus = (
    {qid: dataset[qid]["answer"] for qid in queries} |
    {qid: dataset[qid]["answer"] for qid in random.sample(range(len(dataset)), 20_000)}
)
relevant_docs = {qid: {qid} for qid in eval_dataset["id"]}
dev_evaluator = InformationRetrievalEvaluator(
    corpus=corpus,
    queries=queries,
    relevant_docs=relevant_docs,
    show_progress_bar=True,
    name="gooaq-dev",
)
dev_evaluator(model)

# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.remove_columns("id"),
    eval_dataset=eval_dataset.remove_columns("id"),
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()

# (Optional) Evaluate the trained model on the evaluator after training
dev_evaluator(model)

# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final")

# 9. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(run_name)

Tom Aarsen

OsamaS99 · 2025-02-11T11:20:07Z

Great, sounds good to me.
I just prefered the older training approach in that case, since I couldn't find an example of how to create a custom MNRL dataset (with multiple negatives) for an asymmetric model. Also I think there is a fair amount of open issues asking about a way to train "Two Tower" embedding and rarely anyone refers to the asymmetric models which fits well.

tomaarsen · 2025-02-14T09:52:43Z

Thanks a bunch!

Tom Aarsen

osama salem added 2 commits February 7, 2025 11:03

adapt NoDuplicatesDataLoader Compatability with Asymmetric models

6f707b2

adapt NoDuplicatesBatchSampler

ddf2948

Update documentation for Asym to v3+ training

faeaa93

tomaarsen approved these changes Feb 11, 2025

View reviewed changes

tomaarsen merged commit d4d198d into UKPLab:master Feb 14, 2025
9 checks passed

tomaarsen mentioned this pull request Feb 21, 2025

Tuning two models #3239

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NoDuplicatesDataLoader Compatability with Asymmetric models #3220

NoDuplicatesDataLoader Compatability with Asymmetric models #3220

OsamaS99 commented Feb 7, 2025 •

edited

Loading

tomaarsen commented Feb 11, 2025

tomaarsen commented Feb 11, 2025

OsamaS99 commented Feb 11, 2025

tomaarsen commented Feb 14, 2025

NoDuplicatesDataLoader Compatability with Asymmetric models #3220

NoDuplicatesDataLoader Compatability with Asymmetric models #3220

Conversation

OsamaS99 commented Feb 7, 2025 • edited Loading

tomaarsen commented Feb 11, 2025

tomaarsen commented Feb 11, 2025

OsamaS99 commented Feb 11, 2025

tomaarsen commented Feb 14, 2025

OsamaS99 commented Feb 7, 2025 •

edited

Loading