-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update tokenizer process. #2138
Conversation
1. support llama2, baichuan, bloom, chatglm model data tokenize 2. make the code more readable
Could you double-check that the output of this line ( FastChat/fastchat/train/train.py Lines 140 to 143 in 820de33
You can see more discussions and test cases at #1894 (comment) |
@shibing624 Thanks! It is merged. |
@@ -117,19 +118,17 @@ def preprocess( | |||
total_len = int(target.ne(tokenizer.pad_token_id).sum()) | |||
|
|||
turns = conversation.split(conv.sep2) | |||
cur_len = 1 | |||
target[:cur_len] = IGNORE_TOKEN_ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shibing624 The first token is for BOS. I found BOS is dropped after your PR. Could you fix this?
I reverted this PR because of the BOS problem (#2138 (comment)). Please fix it and submit a new one. |
yes, the tokenizer process with split and find multi round eos_token is very error-prone. so i rewrite the logic of get dialogs from preprocess, then just need tokenize prompt and answer, the code is changed in my repo: https://github.com/shibing624/MedicalGPT/blob/main/supervised_finetuning.py#L653 |
Why are these changes needed?
Related issue number (if applicable)
Checks
format.sh
to lint the changes in this PR.