Skip to content

Commit

Permalink
Add @SQ-AN alternative reference sequence names
Browse files Browse the repository at this point in the history
Enables tools to allow users to make queries with e.g. "1" or "chr1"
interchangeably.  Also allows for the possibility of tools using an alias
when displaying sequence names to the user.  Hat tip @lindenb, fixes samtools#100.

However aliases must not appear elsewhere within the SAM file, in
particular not in RNAME/RNEXT fields.  This ensures that files will
still be parsed correctly by non-@SQ-AN-aware tools.
  • Loading branch information
jmarshall committed Jun 29, 2017
1 parent 84a0c65 commit b0732e0
Showing 1 changed file with 16 additions and 2 deletions.
18 changes: 16 additions & 2 deletions SAMv1.tex
Original file line number Diff line number Diff line change
Expand Up @@ -194,14 +194,28 @@ \subsection{The header section}
grouped by {\sf QNAME}), and {\tt reference} (alignments are grouped by
{\sf RNAME}/{\sf POS}).\\\cline{1-3}
\multicolumn{2}{|l}{\tt @SQ} & Reference sequence dictionary. The order of {\tt @SQ} lines defines the alignment sorting order.\\\cline{2-3}
& {\tt SN}* & Reference sequence name. Each {\tt @SQ} line must have a unique {\tt SN} tag. The value of this
field is used in the
& {\tt SN}* & Reference sequence name.
The {\tt SN} tags and all individual {\tt AN} names in all {\tt @SQ} lines
must be distinct.
The value of this field is used in the
alignment records in {\sf RNAME} and {\sf RNEXT} fields. Regular expression: {\tt [!-)+-\char60\char62-\char126][!-\char126]*}\\\cline{2-3}
& {\tt LN}* & Reference sequence length. \emph{Range}: {\tt [1,2$^{31}$-1]}\\\cline{2-3}
& {\tt AH} & Indicates that this sequence is an alternate locus.%
\footnote{See \url{https://www.ncbi.nlm.nih.gov/grc/help/definitions} for descriptions of \emph{alternate locus} and \emph{primary assembly}.}
The value is the locus in the primary assembly for which this sequence is an alternative, in the format `\emph{chr}{\tt :}\emph{start}{\tt -}\emph{end}', `\emph{chr}' (if known), or `{\tt *}' (if unknown), where `\emph{chr}' is a sequence in the primary assembly.
Must not be present on sequences in the primary assembly.\\\cline{2-3}
& {\tt AN} & Alternative reference sequence names.
A comma-separated list of alternative names that tools may use when referring
to this reference sequence.%
\footnote{For example, given `{\tt @SQ\quad SN:MT\quad AN:chrMT,M,chrM}',
tools can ensure that a user's request for any of `MT', `chrMT', `M',
or~`chrM' succeeds and refers to the same sequence.
Note the restricted set of characters allowed in an alternative name.}
These alternative names are not used elsewhere within the SAM file;
in particular, they must not appear in alignment records' {\sf RNAME}
or~{\sf RNEXT} fields.
\emph{Regular expression}: \emph{name}{\tt (,}\emph{name}{\tt )*}
where \emph{name} is {\tt [0-9A-Za-z][0-9A-Za-z*+.@\_|-]*}\\\cline{2-3}
& {\tt AS} & Genome assembly identifier. \\\cline{2-3}
& {\tt M5} & MD5 checksum of the sequence in the uppercase, excluding spaces but including pads (as `*'s).\\\cline{2-3}
& {\tt SP} & Species.\\\cline{2-3}
Expand Down

0 comments on commit b0732e0

Please sign in to comment.