-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NoDuplicatesDataLoader Compatability with Asymmetric models #3220
Conversation
Hello! The Asym module is not very commonly used at all, but it should indeed still work. I can reproduce your issue, and your fix seems to work correctly. I also want to share that with the v3 update, we switched to using a new approach for training using a I also think that this is a good moment to update the documentation example for
|
Oh, also, I trained 2 models (with Asym and without Asym), and the Asym model was much worse, I'm afraid:
Here is the script that I used: import random
import logging
from datasets import load_dataset, Dataset
from sentence_transformers import (
SentenceTransformer,
SentenceTransformerTrainer,
SentenceTransformerTrainingArguments,
SentenceTransformerModelCardData,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import models
logging.basicConfig(
format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO
)
# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer(
"nreimers/MiniLM-L6-H384-uncased",
model_card_data=SentenceTransformerModelCardData(
language="en",
license="apache-2.0",
model_name="MPNet base trained on AllNLI triplets",
),
)
'''
asym_model = models.Asym(
{
"query": [models.Dense(model.get_sentence_embedding_dimension(), model.get_sentence_embedding_dimension())],
"doc": [models.Dense(model.get_sentence_embedding_dimension(), model.get_sentence_embedding_dimension())],
}
)
model.add_module("asym", asym_model)
# '''
# 3. Load a dataset to finetune on
dataset = load_dataset("sentence-transformers/gooaq", split="train")
dataset = dataset.add_column("id", range(len(dataset)))
dataset_dict = dataset.train_test_split(test_size=10_000, seed=12)
train_dataset: Dataset = dataset_dict["train"].select(range(500_000))
eval_dataset: Dataset = dataset_dict["test"]
'''
def mapper(sample):
return {
"question": {"query": sample["question"]},
"answer": {"doc": sample["answer"]},
}
train_dataset = train_dataset.map(mapper)
eval_dataset = eval_dataset.map(mapper)
# '''
# 4. Define a loss function
loss = MultipleNegativesRankingLoss(model)
# 5. (Optional) Specify training arguments
# run_name = "MiniLM-L6-H384-uncased-gooaq-asym"
run_name = "MiniLM-L6-H384-uncased-gooaq-no-asym"
args = SentenceTransformerTrainingArguments(
# Required parameter:
output_dir="models/mpnet-base-gooaq",
# Optional training parameters:
num_train_epochs=1,
per_device_train_batch_size=128,
per_device_eval_batch_size=128,
learning_rate=2e-5,
warmup_ratio=0.1,
fp16=False, # Set to False if you get an error that your GPU can't run on FP16
bf16=True, # Set to True if you have a GPU that supports BF16
batch_sampler=BatchSamplers.NO_DUPLICATES, # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
# Optional tracking/debugging parameters:
eval_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=500,
save_total_limit=2,
logging_steps=50,
logging_first_step=True,
run_name=run_name, # Will be used in W&B if `wandb` is installed
seed=24,
)
# 6. (Optional) Create an evaluator & evaluate the base model
# The full corpus, but only the evaluation queries
random.seed(12)
queries = dict(zip(eval_dataset["id"], eval_dataset["question"]))
# queries = {
# qid: {"query": question}
# for qid, question in zip(eval_dataset["id"], eval_dataset["question"])
# }
# corpus = (
# {qid: {"doc": dataset[qid]["answer"]} for qid in queries} |
# {qid: {"doc": dataset[qid]["answer"]} for qid in random.sample(range(len(dataset)), 20_000)}
# )
corpus = (
{qid: dataset[qid]["answer"] for qid in queries} |
{qid: dataset[qid]["answer"] for qid in random.sample(range(len(dataset)), 20_000)}
)
relevant_docs = {qid: {qid} for qid in eval_dataset["id"]}
dev_evaluator = InformationRetrievalEvaluator(
corpus=corpus,
queries=queries,
relevant_docs=relevant_docs,
show_progress_bar=True,
name="gooaq-dev",
)
dev_evaluator(model)
# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
model=model,
args=args,
train_dataset=train_dataset.remove_columns("id"),
eval_dataset=eval_dataset.remove_columns("id"),
loss=loss,
evaluator=dev_evaluator,
)
trainer.train()
# (Optional) Evaluate the trained model on the evaluator after training
dev_evaluator(model)
# 8. Save the trained model
model.save_pretrained(f"models/{run_name}/final")
# 9. (Optional) Push it to the Hugging Face Hub
model.push_to_hub(run_name)
|
Great, sounds good to me. |
Thanks a bunch!
|
Currently, I'm working with an asymmetric model, MNRL and providing the InputExamples as follows as in the documentation.
texts = [{"QRY": item["query"]}, {"PROD": item["titles"][0]}] + [{"PROD": t} for t in item["titles"][1:]]
Using
NoDuplicatesDataLoader
resulted in the following errorAttributeError: 'dict' object has no attribute 'strip'
@tomaarsen