TimD.one

CIGAR Strings

Overview

CIGAR (Compact Ideosyncratic Gapped Alignment Report) strings are a compact method of representing how a query sequence aligns to a reference genome, and are used by the SAM/BAM file formats.

For each query base, how it aligns to the reference is denoted by a single symbol (I for insertions, D for deletions, and M for matches). Adjacent matching symbols are counted and then collapsed to a more succinct representation (i.e. 'MMDDDDMMM' becomes 2M4D3M). Here's a basic example:

Inputs
Reference: ACGGCATCAGCGATCATCGGCATATCGACT
Query:     ATCAAGCGTCGCCCTAT

Alignment
Reference:   ACGGCATCA_GCGATCATCGGCATATCGACT
Query:            ATCAAGCG____TCGCCCTAT
Symbols:          MMMMIMMMDDDDMMMMMMMMM
CIGAR:       4M1I3M4D9M

Summary of Symbols

CIGAR strings have expanded beyond this initial purpose and can contain several other symbols as well, summarized in the below table:

Symbol	Name	Brief Description	Consumes Query	Consumes Ref
M	Match	no insertion or deletions, bases may not agree	✓	✓
I	Insertion	additional base in query (not in reference)	✓	✗
D	Deletion	query is missing base from reference	✗	✓
=	Equal	no insertions or deletions, and bases agree	✓	✓
X	Not Equal	no insertions or deletions, bases do not agree	✓	✓
N	None	no query bases to align, an expected read gap (spliced read)	✗	✓
S	Soft-Clipped	bases on end of read are not aligned but stored in SAM	✓	✗
H	Hard-Clipped	bases on end of read are not aligned, not stored in SAM	✗	✗
P	Padding	neither read nor reference has a base here	✗	✗

Towards More Informative Alignments (CIGARX)

You may have noticed in the previous example that "match" is a somewhat of a mis-nomer: the CIGAR symbol M specifies that the alignment matches, but says nothing of the actual bases (which may differ). To fix this problem, all M symbols can further be categorized into = and X symbols depending upon whether the query base matches the reference base. This format is sometimes called "eXtended CIGAR", or CIGARX.

Alignment
Reference:   ACGGCATCA_GCGATCATCGGCATATCGACT
Query:            ATCAAGCG____TCGCCCTAT
Old Symbols:      MMMMIMMMDDDDMMMMMMMMM
New Symbols:      ====I===DDDD===X=X===
CIGAR:       4=1I3=4D3=1X1=1X3=

Clipped Alignments

As a result of Smith-Waterman alignment, it is possible that not all query bases are aligned to the reference (bases on the ends of a read may be ignored). This is called "clipping", and as a result, no alignment information is available for these bases. When soft-clipping (S), these unaligned bases are still stored in the SAM file's SEQ field. With hard-clipping (H), however, they are not. Hard clipping is only typically used when a single query read aligns to multiple locations, and thus has multiple entries in the SAM file. In this case, there's no reason to store these clipped bases for every entry corresponding to that read.

Inputs
Reference:   ACGGCATCAGCGATCATCGGCATATCGACT
Query:                AAAATCATCGGCCCCCCC

Alignment
Reference:   ACGGCATCAGCGATCATCGGCATATCGACT
Query:                ...ATCATCGGC......
CIGAR:       3S9M6S or 3H9M6H

SAM File
Soft-Clipped: SEQ = AAAATCATCGGCCCCCCC
Hard-Clipped: SEQ =    ATCATCGGC

Spliced Alignments

In some contexts (such as aligning mRNA or cDNA), it is expected for query reads to contain large biologically significant gaps/deletions when aligned to the reference. The symbol N is used to distinguish these gaps from ordinary deletions (D).

Alignment
Reference:   ACGGCATCAGCGATCATCGGCATATCGACT
Query:          GCAT______________ATATC
CIGAR:       4M14N5M

Padded Alignments

When performing de-novo assembly, we simultaneously align multiple queries to one another and a reference genome. We must now "pad" the reference genome so that we can distinguish between different insertion types at the same location in the reference genome. The following example shows that (prior to padding) Queries 2a and 2b have insertions at the same position relative to the reference but different positions relative to Query 1. Padding is necessary to make this distinction, and matches between padding characters - on a query and reference are denoted with symbol P.

Alignment
Reference:  CACGATCA--GACCGATACGTCCGA
Query 1:      CGATCAGAGACCGATA
Query 2a:       ATCA-AGACCGATAC
Query 2b:       ATCAG-GACCGATAC

Cigar 1:    6M2I8M
Cigar 2a:   4M1P1I9M
Cigar 2b:   4M1I1P9M