CIGAR (Compact Ideosyncratic Gapped Alignment Report) strings are a compact method of representing how a query sequence aligns to a reference genome, and are used by the SAM/BAM file formats.
For each query base, how it aligns to the reference is denoted by a single symbol (I
for insertions, D
for deletions, and M
for matches). Adjacent matching symbols are counted and then collapsed to a more succinct representation (i.e. 'MMDDDDMMM'
becomes 2M4D3M
). Here's a basic example:
Inputs
Reference: ACGGCATCAGCGATCATCGGCATATCGACT
Query: ATCAAGCGTCGCCCTAT
Alignment
Reference: ACGGCATCA_GCGATCATCGGCATATCGACT
Query: ATCAAGCG____TCGCCCTAT
Symbols: MMMMIMMMDDDDMMMMMMMMM
CIGAR: 4M1I3M4D9M
CIGAR strings have expanded beyond this initial purpose and can contain several other symbols as well, summarized in the below table:
Symbol | Name | Brief Description | Consumes Query | Consumes Ref |
---|---|---|---|---|
M | Match | no insertion or deletions, bases may not agree | ✓ | ✓ |
I | Insertion | additional base in query (not in reference) | ✓ | ✗ |
D | Deletion | query is missing base from reference | ✗ | ✓ |
= | Equal | no insertions or deletions, and bases agree | ✓ | ✓ |
X | Not Equal | no insertions or deletions, bases do not agree | ✓ | ✓ |
N | None | no query bases to align, an expected read gap (spliced read) | ✗ | ✓ |
S | Soft-Clipped | bases on end of read are not aligned but stored in SAM | ✓ | ✗ |
H | Hard-Clipped | bases on end of read are not aligned, not stored in SAM | ✗ | ✗ |
P | Padding | neither read nor reference has a base here | ✗ | ✗ |
You may have noticed in the previous example that "match" is a somewhat of a mis-nomer: the CIGAR symbol M
specifies that the alignment matches, but says nothing of the actual bases (which may differ). To fix this problem, all M
symbols can further be categorized into =
and X
symbols depending upon whether the query base matches the reference base. This format is sometimes called "eXtended CIGAR", or CIGARX.
Alignment
Reference: ACGGCATCA_GCGATCATCGGCATATCGACT
Query: ATCAAGCG____TCGCCCTAT
Old Symbols: MMMMIMMMDDDDMMMMMMMMM
New Symbols: ====I===DDDD===X=X===
CIGAR: 4=1I3=4D3=1X1=1X3=
As a result of Smith-Waterman alignment, it is possible that not all query bases are aligned to the reference (bases on the ends of a read may be ignored). This is called "clipping", and as a result, no alignment information is available for these bases. When soft-clipping (S
), these unaligned bases are still stored in the SAM file's SEQ field. With hard-clipping (H
), however, they are not. Hard clipping is only typically used when a single query read aligns to multiple locations, and thus has multiple entries in the SAM file. In this case, there's no reason to store these clipped bases for every entry corresponding to that read.
Inputs
Reference: ACGGCATCAGCGATCATCGGCATATCGACT
Query: AAAATCATCGGCCCCCCC
Alignment
Reference: ACGGCATCAGCGATCATCGGCATATCGACT
Query: ...ATCATCGGC......
CIGAR: 3S9M6S or 3H9M6H
SAM File
Soft-Clipped: SEQ = AAAATCATCGGCCCCCCC
Hard-Clipped: SEQ = ATCATCGGC
In some contexts (such as aligning mRNA or cDNA), it is expected for query reads to contain large biologically significant gaps/deletions when aligned to the reference. The symbol N
is used to distinguish these gaps from ordinary deletions (D
).
Alignment
Reference: ACGGCATCAGCGATCATCGGCATATCGACT
Query: GCAT______________ATATC
CIGAR: 4M14N5M
When performing de-novo assembly, we simultaneously align multiple queries to one another and a reference genome. We must now "pad" the reference genome so that we can distinguish between different insertion types at the same location in the reference genome. The following example shows that (prior to padding) Queries 2a and 2b have insertions at the same position relative to the reference but different positions relative to Query 1. Padding is necessary to make this distinction, and matches between padding characters -
on a query and reference are denoted with symbol P
.
Alignment
Reference: CACGATCA--GACCGATACGTCCGA
Query 1: CGATCAGAGACCGATA
Query 2a: ATCA-AGACCGATAC
Query 2b: ATCAG-GACCGATAC
Cigar 1: 6M2I8M
Cigar 2a: 4M1P1I9M
Cigar 2b: 4M1I1P9M