FAST5 is a data format developed by Oxford Nanopore Technologies (ONT), a specific HDF5 file structure designed to store raw nanopore current data in addition to flow cell metadata and optional basecalling results. The Hierarchical Data Format (HDF) was designed to allow efficient storage and access of arbitrarily structured datasets. ONT has defined two types of FAST5 files: single-FAST5s (a deprecated space-inefficient format which stores a single read per file), and multi-FAST5s (which store many, usually 4000, reads per file). Both are described below in more detail:
Here is an example raw single-FAST5 file:
PreviousReadInfo/ # useless, legacy
UniqueGlobalKey/
context_tags/ # general metadata
tracking_id/ # device metadata
channel_id/ # channel metadata
Raw/
Reads/
Read_418/ # read-specific metadata
Signal # raw current data
context_tags
group stores general metadata such as the filename
, experiment_kit
, and experiment_type
.tracking_id
group stores device metadata such as the asic_version
, device_type
, flow_cell_id
, heatsink_temp
and lots of software versioning information.channel_id
group stores channel metadata such as the channel_number
, sampling_rate
, digitisation
, offset
, and range
.Read_<#>
group stores read-specific metadata such as the duration
and start_time
, and contains the raw Signal
data.
These are the raw current values measured by the nanopore device.As far as I can tell, PreviousReadInfo
contains no information (perhaps it's used to identify the complement in 1D2 sequencing?).
If the file has been basecalled with Guppy (using --fast5-out
), the single-FAST5 file may include the additional groups:
Analyses/
Segmentation_000/
Summary/
segmentation/ # read size
Basecall_1D_000/
Summary/
basecall_1d_template/ # read summary
BaseCalled_template/
StateData # basecaller last layer
Trace # base probabilities
Move # base movements
Fastq # output sequence
Segmentation_001/
Basecall_1D_001/
...
The suffix appended to each group begins at 000
and will count upwards each time additional basecalling data is appended to the FASTQ file.
segmentation
group contains only the has_template
, duration_template
and first_sample_template
attributes.Basecall_1D_000
group contains basecalling metadata such as the basecaller version
, model_type
, and a time_stamp
.basecall_1d_template
group includes basecalling result summaries such as the mean_qscore
, sequence_length
, and called_events
, as well as the model's block_stride
.If the basecaller was called with --post-out
, several additional fields will be present.
StateData
contains the basecaller's last-layer transition probability predictions, as well as the scale
and offset
.Trace
contains the flip-flop predicted base probabilities (0-255
) for each time step.Move
stores the number of bases the basecaller decided have been sequenced during each time step.Fastq
contains the sequence of predicted bases, output in FASTQ format.Legacy event-based basecallers such as Albacore 1.0 and event-based tools such as Tombo will append additional fields which include event detection and segmentation data, which I will not cover here.
Multi-FAST5 files are incredibly similar to single-FAST5 files, with a few minor changes. Most importantly, multiple reads (usually 4000) are stored per file, reducing disk storage.
read_e6b2e0e3-e8ab-4158-8dec-9b594db0b508/
context_tags/
tracking_id/
channel_id/
Raw/
Signal
Analyses/
...
read_6eac07d9-74b5-483f-a5f9-3dcccb606bac/
...
The read_<read_id>
group stores run metadata such as the pore_type
and run_id
.
Within this group, most information is stored in the same format.
The main notable difference that I have found is that Raw
group stores the same information previously stored under Raw/Reads/Read_<#>
ONT provides a Python library called ont_fast5_api for working with FAST5 files, which includes a single_to_multi_fast5.py
conversion script (which unfortunately doesn't work at the time of writing).
Personally, I have found ont_fast5_api
to be too inflexible for research purposes and have instead used h5py.
There exist several useful HDF5 command-line tools such as h5ls
and h5dump
, in addition to graphical HDF5 viewers such as hdfview and vitables.
The FASTA format is used to represent amino or nucleic acid sequences. For each sequence:
>
and is folowed by a sequence identifier and optional description.Here is a simple FASTA file:
>sequence1
ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG
>sequence2
AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT
NCBI has defined a set of standard identifiers used to map sequences obtained from databases back to the original database record.
A sample NCBI sequence obtained from GenInfo and GenBank might look as follows:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
Note that it is common practice to restrict FASTA files to at most 80 characters per line. Additionally, while the sequence identifier may contain arbitrary text, the initial >
cannot be followed by a space. Some parsers will also consider any text after a space on this line to be a comment, which is why NCBI uses the |
delimiter.
There are several commonly-used FASTA file extensions, each with slightly different meanings.
Extension | Meaning |
---|---|
.fasta |
generic FASTA |
.fa |
generic FASTA |
.fna |
FASTA containing nucleic acids |
.ffn |
FASTA containing protein-coding nucleotides |
.faa |
FASTA containing amino acids |
.frn |
FASTA containing non-coding RNA |
FASTQ is an extension of the FASTA format (used for storing nucleotide or amino acid sequences), which includes quality scores. For each sequence:
@
and is followed by a sequence identifier and optional description.+
and is optionally (usually not) followed by the same sequence identifier.Here is a simple FASTQ file:
@sequence1
ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG
+
!@#%!ADN#$#!@#%%sDFGN!#$%!GQWEWG
@sequence2
AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT
+
sahjdlfhasdGAEHKWEKNVLAEW?!@#($%#$(%c$(#$*2
The SAM (Sequence Alignment Map) format is a text-based format for storing the alignments of nucleotide sequences. The SAMv1 specification can be found here, and its compressed binary equivalent is BAM. Every SAM file contains both a header and alignment section; I'll start by explaining the alignment section.
Each line in this section consists 11 tab-delimited mandatory fields, followed by a variable number of optional tags. An aligned nucleotide sequence is referred to as a "query" or "read", whereas the corresponding section of the reference sequence is known as the "template". There may be multiple lines corresponding to a single read if there are multiple or chimeric mappings. A chimeric mapping occurs when there is a non-linear alignment of a query to the reference. This usually happens when a read spans a structural variant such as an inversion mutation.
Column | Field | Type | Description |
---|---|---|---|
1 | Query NAME | string | query name, non-unique since there may be multiple alignments of the same query. |
2 | bitwise FLAG | integer | 16-bit FLAG, described in more detail below. |
3 | Ref sequence NAME | string | reference sequence name (which must be included in the @SQ SN tag). |
4 | mapping POSition | integer | 1-based leftmost mapping position, 0 for an unmapped query. |
5 | MAPping Quality | integer | Quality score encoding mapping confidence. |
6 | CIGAR string | string | CIGAR string encoding alignment of read to reference. |
7 | Ref name of NEXT read | string | Name of the next read (or the read's mate). |
8 | Position of NEXT read | integer | 1-based leftmost mapping position of the next read (or the read's mate). |
9 | Template LENgth | integer | The (inclusive) length of the reference section (template) the read is aligned to. |
10 | acid code SEQuence | string | A string representing the nucleic or amino acids which compose the read. |
11 | encoded QUALity scores | string | ASCII-encoded quality scores for each called base. |
Additional tab-delimited fields can be included which follow the TAG:TYPE:VALUE
format.
As with header tags, the TAG
must be two letters, and lowercase letters are reserved for end users.
The most common types are integer (i
), float (f
), character (A
), and string (Z
).
The FLAG field stores 12 boolean flags as a single 12-bit integer in a bitwise OR format.
For example, if the 1st, 2nd, and 5th boolean flags are True
and all other flags are False
, then the FLAG field will be: 20+21+24 = 1+2+16 = 19.
The most important flags (according to me) in the following table are shown in bold.
Flag | Binary | Decimal | Description |
---|---|---|---|
1 | 0000 0000 0001 | 1 | template has multiple segments |
2 | 0000 0000 0010 | 2 | each segment aligns properly |
3 | 0000 0000 0100 | 4 | segment unmapped |
4 | 0000 0000 1000 | 8 | RNEXT segment unmapped |
5 | 0000 0001 0000 | 16 | SEQ is reverse complemented |
6 | 0000 0010 0000 | 32 | SEQ of RNEXT is reverse complemented |
7 | 0000 0100 0000 | 64 | the first segment in the template |
8 | 0000 1000 0000 | 128 | the last segment in the template |
9 | 0001 0000 0000 | 256 | secondary alignment (multiple alignments possible, this one isn't primary) |
10 | 0010 0000 0000 | 512 | failed filtering stage, such as average Qscore |
11 | 0100 0000 0000 | 1024 | PCR or optical duplicate |
12 | 1000 0000 0000 | 2048 | supplementary (chimeric) alignment |
read001 89 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAGGATACTG *
read002 0 ref 9 30 2S6M1P1I4M * 0 0 AAAGATAAGGATA *
read003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0
read004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
read003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5H6M,30,1
read001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
Each header line begins with the character @
followed by one of the following two-letter codes:
Code | Name | Description |
---|---|---|
HD | HeaDer | File metadata. |
SQ | SeQuence | Reference sequence information. |
RG | Read Group | Read group information. |
PG | ProGram | Program information. |
CO | COmment | Single-line text comment, with no formatting otherwise enforced. |
Following the two-letter code, each header line (excluding comments) consists of two-letter tab-delimited tags in the format TAG:VALUE
.
Many standard tags (some of them mandatory) are defined in the SAM specification.
You may also define your own (lower case) header tags.
Here are the most commonly used header tags:
Tag | Name | Description |
---|---|---|
@HD VN | VersioN | SAM specification version. Mandatory. |
@HD SO | Sorting Order | Sorting order of alignments; one of: unknown (default), unsorted , query name , and coordinate . |
@SQ SN | Sequence Name | Reference sequence name. Mandatory. |
@SQ LN | sequence LeNgth | Reference sequence length. Mandatory. |
@RG ID | IDentifier | Read group identifier. Mandatory if RG is present. |
@PG ID | IDentifier | Program identifier. Mandatory if PG is present. |
@PG CL | Command Line | Command line arguments of program which generated SAM file. |
@HD VN:1.6 SO:coordinate
@SQ SN:chr20 LN:64444167 my:custom tag data here
@PG ID:aligner VN:1.2.3 CL:/usr/bin/aligner reads.fastq ref.fasta --eqx
The BAM (Binary Alignment Map) format is the compressed binary equivalent of the SAM format. As such, it is designed for computer (not human) manipulation, and I will not go into detail when the specification can be found here.
For performing standard operations on BAM files, I would recommend using samtools. For custom functionality, consider either pysam for convenience or htslib for speed.
The BED (Browser Extensible Data) format is used to store/annotate regions of a reference FASTA/FASTQ file. It is a tab-separated format consisting of twelve fields, and only the first three are mandatory.
Column | Field | Description |
---|---|---|
1 | chromosome | chromosome, contig, or scaffold name |
2 | start | 0-based start coordinate for region |
3 | end | non-inclusive end coordinate for region |
4 | name | name of the region |
5 | score | score (between 0 and 1000 for older formats) |
6 | strand | DNA strand orientation: positive (+ ), negative (- ), or unknown (. ) |
7 | thick_start | start coordinate for thicker GUI display of region |
8 | thick_end | end coordinate for thicker GUI display of region |
9 | RGB | RGB value in the format R,G,B for GUI display of region |
10 | block_count | number of blocks in this region |
11 | block_sizes | comma-separated list of sizes for block_count blocks |
12 | block_starts | comma-separated list of block_count block coordinates, relative to start |
The number of columns used is occasionally part of the file extension: for example, .bed6
.
A powerful tool for working with BED files is bedtools, which only uses the first six columns.
The VCF (Variant Call Format) is a text-based format used to store arbitrary genome and genotype variants with respect to a reference FASTA.
It consists of meta-information lines (prefixed with ##
), a mandatory fixed header (prefixed with #
), and tab-delimited data lines. Empty fields use .
as a placeholder.
Since the contents of the meta-information and data lines are highly interdependent, it makes sense to first show an example VCF file:
##fileformat=VCFv4.3
##fileDate=20210115
##source=VariantCallerScript
##reference=file:///references/HG38.fasta
##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens">
##phasing=partial
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002
20 14370 rs6 G A 29 PASS DP=14;DB; GT:DP:HQ 0|0:1:51,51 1|0:8:51,51
20 17330 . T A 3 q10 DP=11;AF=0.017 GT:DP:HQ 0|0:3:58,50 0|1:5:65,3
20 1110696 rs5 A G,T 67 PASS DP=10;AA=T;DB GT:DP:HQ 1|2:6:23,27 2|1:0:18,2
20 1230237 . T . 47 PASS DP=13;AA=T GT:DP:HQ 0|0:7:56,60 0|0:4:51,51
20 1234567 ms1 GTC G,GTCT 50 PASS DP=9;AA=G GT:DP 0/1:4 0/2:2
As you can see, the INFO
, FILTER
, and FORMAT
"structured" lines in the meta-header are used to define corresponding fields in the data section.
Each meta-information line must be in the format ##KEY=VALUE
.
They can be in any order with the exception of ##fileformat=v4.3
, which must be first.
Structured lines (such as INFO
, FILTER
, and FORMAT
) require the following format:
##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype",Ploidy="3">
ID
is a unique two-character identifier for the new field. Many commonly-used fields have standard abbreviations. Number
determines the number of comma-separated data values stored for each entry. The supported data Type
s are Integer
, Float
, Flag
, Character
, and String
. Flag
is commonly used with Number=0
. Ploidy
is an example optional argument, which must have a String
value.
The header line is tab-delimited and names the 8 fixed mandatory columns:
#CHROM POS ID REF ALT QUAL FILTER INFO
Column | Description |
---|---|
CHROM | The CHROMosome/contig of the variant. |
POS | The 1-based POSition of variant. |
ID | An optional IDentifier for the variant. |
REF | The REFerence allele (forward strand). |
ALT | The ALTernate allele (forward strand). |
QUAL | The Phred QUALity score for the variant. |
FILTER | The name of any failed FILTERs, or the value PASS . |
INFO | Optional additional INFOrmation. |
If genotype data is present, these columns are followed by an additional FORMAT
column header and an arbitrary number of sample IDs (e.g. SAMPLE
, HG001
, HG002
)
Although most columns are explained sufficiently by the table above, others are more complicated.
REF/ALT
The following table illustrates how different variant types would be represented using the REF
and ALT
columns.
REF | ALT | Description |
---|---|---|
A | C | Single-nucleotide polymorphism (SNP), or substitution. |
A | ATTT | Insertion (of TTT), preceding A included. |
AG | A | Deletion (of G), preceding A included. |
AG | A,AGAG | Complex variant (diploid organism). Insertion for one allele, deletion for the other. |
AG | CT | Multi-nucleotide polymorphism (MNT). Most tools now report as two SNPs. |
INFO
Annotations in the INFO
field are tag-value pairs, where tags and values are separated by an equals sign (=
), and each pair is separated by a semicolon (;
).
FORMAT/SAMPLE
The FORMAT
column contains 2-letter identifier values separated by colons (:
), which correspond to data values in the SAMPLE
column (also separated by colons). The Type
and Number
of each data value (separated by commas), is given by the meta-information header. Here are some commonly used ones:
Tag | Description |
---|---|
AD | Unfiltered Allele Depth, number of reads supporting each allele. |
GT | GenoType, described in more detail below. |
DP | Filtered DePth, number of reads supporting each allele. |
GQ | Phred Genotype Quality, confidence that GT is correct, derived from PL . |
PL | Phred Likelihoods of the possible genotypes. |
GT
With polyploid genomes, variants do not necessarily occur on every copy of the genome. The GT
tag contains information regarding this.
GT | Name | Description |
---|---|---|
0/0 | homozygous reference | No mutations (REF/REF). |
0/1 | heterozygous | Mutation on one copy (REF/ALT1). |
1/1 | homozygous alternate | Same mutation on each copy (ALT1/ALT1). |
1/2 | heterozygous alternate | Different mutation on each copy (ALT1/ALT2). |
Additionally, the delimiter typically changes from a slash (/
) to a pipe (|
) when the haplotypes have been phased. This means that instead of knowing that the variant occurs on one of two haplotypes, you also know which haplotype it occurred on. The following table shows how this works.
REF | ALT | GT | Hap 1 | Hap 2 |
---|---|---|---|---|
A | C | 0|0 | A | A |
A | C | 0|1 | A | C |
A | C | 1|0 | C | A |
A | C | 1|1 | C | C |
A | C,T | 1|2 | C | T |
A | C,T | 2|1 | T | C |
Tabix is a binary file format commonly used for indexing compressed VCF files. Its full specification can be found here. Indexing and compression can either be performed using bcftools:
bcftools view file.vcf -Oz -o <filename>.vcf.gz
bcftools index <filename>.vcf.gz
or using the stand-alone bgzip and tabix programs:
bgzip <filename.vcf>
tabix -p vcf <filename>.vcf.gz