Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for SQ-AN alternative sequence names #931

Merged
merged 2 commits into from
Sep 25, 2019

Conversation

jmarshall
Copy link
Member

Support alternative sequence names as per samtools/hts-specs#100.

This is a work-in-progress for testing purposes, as sam_hdr_update_line() does not yet handle AN altnames. Everything else should be ready to go though.

(View the sam_hrecs_remove_hash_entry() diff with -w as the bulk of the code is unchanged other than its indentation level.)

This adds the alternative names into ref_hash without identifying them
as non-canonical names. This means that all queries based on ref_hash
will accept either SN or AN names, even those that explicitly ask for
SN -- this is considered a feature.

As per the existing SN handling in sam_hrecs_update_hashes(), duplicates
produce warnings but are otherwise ignored and the existing ref_hash entry
is left as is. When removing an AN tag, ref_hash entries are only removed
if they do indeed point to the `@SQ` item being altered (otherwise they
are left as is).

[TODO] sam_hdr_update_line() does not yet handle AN altnames.

(View the sam_hrecs_remove_hash_entry() diff with -w as the bulk
of the code is unchanged other than its indentation level.)
Copy link
Member

@daviesrob daviesrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, bar a missing NULL check on string_ndup(). It looks like SN: always overrides AN: if the latter duplicates the SN: from a different @sq line, which is probably a good thing. Currently it will complain if AN: contains the SN: from the same line - should we instead stay silent as it's not really harmful?

if (aux.p == token)
continue;

char *name = string_ndup(hrecs->str_pool, token, aux.p - token);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could do with a check for NULL here...

Spot cases where an @sq SN: name is the same as an altname
from an earlier @sq AN: tag, and ensure that the reference
hash table points to the entry using it for SN.  Due to
the way they are read, these clashes would be resolved
correctly if they occurred in SAM or BAM files, but it
was possible to accidentally override the SN name when
reading a CRAM file (making it impossible to select the
masked reference when creating iterators).

Don't complain about duplicates in sam_hrecs_add_ref_altnames()
when the name refers to the correct entry.  This can happen
when SN: is duplicated in AN:, and possibly also if
sam_hrecs_add_ref_altnames gets called more than once for
the same line.

Don't remove the hash table entry in sam_hrecs_remove_ref_altnames
if the name is the primary one (SN:) for the reference.

The "Seen already?" checks in sam_hrecs_update_hashes() are
reordered as the if statements were becoming rather deeply
nested.
@daviesrob
Copy link
Member

I've added a commit to adjust how duplicates between SN: and AN: tags are handled, in case someone makes a header like this:

@SQ	SN:CHROMOSOME_I	LN:1009800	AN:1,foo,CHROMOSOME_II
@SQ	SN:CHROMOSOME_II	LN:5000
@SQ	SN:CHROMOSOME_III	LN:5000	AN:CHROMOSOME_I
@SQ	SN:CHROMOSOME_IV	LN:5000	AN:CHROMOSOME_IV
@SQ	SN:CHROMOSOME_V	LN:5000	AN:CHROMOSOME_V

This didn't cause a problem for SAM or BAM, but with CRAM the CHROMOSOME_II altname on the first entry masked the primary name on the second, effectively making it disappear.

It also fixes a few other edge cases, like what happens when you delete an AN: tag with the same name as SN: in it.

@jmarshall
Copy link
Member Author

Thanks for those updates, Rob. Spec says “no duplicates across all SN and AN entries” which is a good formal rule but agree samtools style is to be a bit more lenient.

Though it does open it up to problems like the one described in your last paragraph but glad you’ve caught that one!

@daviesrob daviesrob merged commit 886a665 into samtools:develop Sep 25, 2019
@daviesrob
Copy link
Member

Yes, while the specification disallows it, I think it's good to have a way of handling the problem when someone inevitably ignores the rule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants