Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NoDuplicatesDataLoader Compatability with Asymmetric models #3220

Merged
merged 3 commits into from
Feb 14, 2025

Conversation

OsamaS99
Copy link
Contributor

@OsamaS99 OsamaS99 commented Feb 7, 2025

Currently, I'm working with an asymmetric model, MNRL and providing the InputExamples as follows as in the documentation.
texts = [{"QRY": item["query"]}, {"PROD": item["titles"][0]}] + [{"PROD": t} for t in item["titles"][1:]]
Using NoDuplicatesDataLoader resulted in the following error AttributeError: 'dict' object has no attribute 'strip'

@tomaarsen

@tomaarsen
Copy link
Collaborator

Hello!

The Asym module is not very commonly used at all, but it should indeed still work. I can reproduce your issue, and your fix seems to work correctly.

I also want to share that with the v3 update, we switched to using a new approach for training using a SentenceTransformerTrainer. The old approach (with model.fit and InputExample) still works, but I'd recommend switching to the newer one. It simply gives you a bit more control (docs).

I also think that this is a good moment to update the documentation example for Asym. I'll push that into this PR, and then I think it's ready to be merged. Feel free to let me know what you think!

  • Tom Aarsen

@tomaarsen
Copy link
Collaborator

Oh, also, I trained 2 models (with Asym and without Asym), and the Asym model was much worse, I'm afraid:

Here is the script that I used:

import random
import logging
from datasets import load_dataset, Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import models

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)

# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
    "nreimers/MiniLM-L6-H384-uncased",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="MPNet base trained on AllNLI triplets",
    ),
)
'''
asym_model = models.Asym(
    {
        "query": [models.Dense(model.get_sentence_embedding_dimension(), model.get_sentence_embedding_dimension())],
        "doc": [models.Dense(model.get_sentence_embedding_dimension(), model.get_sentence_embedding_dimension())],
    }
)
model.add_module("asym", asym_model)
# '''

# 3. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/gooaq", split="train")
dataset = dataset.add_column("id", range(len(dataset)))
dataset_dict = dataset.train_test_split(test_size=10_000, seed=12)
train_dataset: Dataset = dataset_dict["train"].select(range(500_000))
eval_dataset: Dataset = dataset_dict["test"]

'''
def mapper(sample):
    return {
        "question": {"query": sample["question"]},
        "answer": {"doc": sample["answer"]},
    }

train_dataset = train_dataset.map(mapper)
eval_dataset = eval_dataset.map(mapper)
# '''

# 4. Define a loss function
loss = MultipleNegativesRankingLoss(model)

# 5. (Optional) Specify training arguments
# run_name = "MiniLM-L6-H384-uncased-gooaq-asym"
run_name = "MiniLM-L6-H384-uncased-gooaq-no-asym"
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/mpnet-base-gooaq",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=128,
    per_device_eval_batch_size=128,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=False,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=True,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    # Optional tracking/debugging parameters:
    eval_strategy="steps",
    eval_steps=500,
    save_strategy="steps",
    save_steps=500,
    save_total_limit=2,
    logging_steps=50,
    logging_first_step=True,
    run_name=run_name,  # Will be used in W&B if `wandb` is installed
    seed=24,
)

# 6. (Optional) Create an evaluator & evaluate the base model
# The full corpus, but only the evaluation queries
random.seed(12)
queries = dict(zip(eval_dataset["id"], eval_dataset["question"]))
# queries = {
#     qid: {"query": question}
#     for qid, question in zip(eval_dataset["id"], eval_dataset["question"])
# }
# corpus = (
#     {qid: {"doc": dataset[qid]["answer"]} for qid in queries} |
#     {qid: {"doc": dataset[qid]["answer"]} for qid in random.sample(range(len(dataset)), 20_000)}
# )
corpus = (
    {qid: dataset[qid]["answer"] for qid in queries} |
    {qid: dataset[qid]["answer"] for qid in random.sample(range(len(dataset)), 20_000)}
)
relevant_docs = {qid: {qid} for qid in eval_dataset["id"]}
dev_evaluator = InformationRetrievalEvaluator(
    corpus=corpus,
    queries=queries,
    relevant_docs=relevant_docs,
    show_progress_bar=True,
    name="gooaq-dev",
)
dev_evaluator(model)

# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset.remove_columns("id"),
    eval_dataset=eval_dataset.remove_columns("id"),
    loss=loss,
    evaluator=dev_evaluator,
)
trainer.train()

# (Optional) Evaluate the trained model on the evaluator after training
dev_evaluator(model)

# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final")

# 9. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(run_name)
  • Tom Aarsen

@OsamaS99
Copy link
Contributor Author

Great, sounds good to me.
I just prefered the older training approach in that case, since I couldn't find an example of how to create a custom MNRL dataset (with multiple negatives) for an asymmetric model. Also I think there is a fair amount of open issues asking about a way to train "Two Tower" embedding and rarely anyone refers to the asymmetric models which fits well.

@tomaarsen tomaarsen merged commit d4d198d into UKPLab:master Feb 14, 2025
9 checks passed
@tomaarsen
Copy link
Collaborator

Thanks a bunch!

  • Tom Aarsen

@tomaarsen tomaarsen mentioned this pull request Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants