-
Notifications
You must be signed in to change notification settings - Fork 443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for SQ-AN alternative sequence names #931
Conversation
This adds the alternative names into ref_hash without identifying them as non-canonical names. This means that all queries based on ref_hash will accept either SN or AN names, even those that explicitly ask for SN -- this is considered a feature. As per the existing SN handling in sam_hrecs_update_hashes(), duplicates produce warnings but are otherwise ignored and the existing ref_hash entry is left as is. When removing an AN tag, ref_hash entries are only removed if they do indeed point to the `@SQ` item being altered (otherwise they are left as is). [TODO] sam_hdr_update_line() does not yet handle AN altnames. (View the sam_hrecs_remove_hash_entry() diff with -w as the bulk of the code is unchanged other than its indentation level.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, bar a missing NULL check on string_ndup()
. It looks like SN: always overrides AN: if the latter duplicates the SN: from a different @sq line, which is probably a good thing. Currently it will complain if AN: contains the SN: from the same line - should we instead stay silent as it's not really harmful?
if (aux.p == token) | ||
continue; | ||
|
||
char *name = string_ndup(hrecs->str_pool, token, aux.p - token); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could do with a check for NULL here...
Spot cases where an @sq SN: name is the same as an altname from an earlier @sq AN: tag, and ensure that the reference hash table points to the entry using it for SN. Due to the way they are read, these clashes would be resolved correctly if they occurred in SAM or BAM files, but it was possible to accidentally override the SN name when reading a CRAM file (making it impossible to select the masked reference when creating iterators). Don't complain about duplicates in sam_hrecs_add_ref_altnames() when the name refers to the correct entry. This can happen when SN: is duplicated in AN:, and possibly also if sam_hrecs_add_ref_altnames gets called more than once for the same line. Don't remove the hash table entry in sam_hrecs_remove_ref_altnames if the name is the primary one (SN:) for the reference. The "Seen already?" checks in sam_hrecs_update_hashes() are reordered as the if statements were becoming rather deeply nested.
I've added a commit to adjust how duplicates between
This didn't cause a problem for SAM or BAM, but with CRAM the CHROMOSOME_II altname on the first entry masked the primary name on the second, effectively making it disappear. It also fixes a few other edge cases, like what happens when you delete an |
Thanks for those updates, Rob. Spec says “no duplicates across all SN and AN entries” which is a good formal rule but agree samtools style is to be a bit more lenient. Though it does open it up to problems like the one described in your last paragraph but glad you’ve caught that one! |
Yes, while the specification disallows it, I think it's good to have a way of handling the problem when someone inevitably ignores the rule. |
Support alternative sequence names as per samtools/hts-specs#100.
This is a work-in-progress for testing purposes, as
sam_hdr_update_line()
does not yet handle AN altnames. Everything else should be ready to go though.(View the
sam_hrecs_remove_hash_entry()
diff with -w as the bulk of the code is unchanged other than its indentation level.)