FAST5 is a data format developed by Oxford Nanopore Technologies (ONT), a specific HDF5 file structure designed to store raw nanopore current data in addition to flow cell metadata and optional basecalling results. The Hierarchical Data Format (HDF) was designed to allow efficient storage and access of arbitrarily structured datasets. ONT has defined two types of FAST5 files: single-FAST5s (a deprecated space-inefficient format which stores a single read per file), and multi-FAST5s (which store many, usually 4000, reads per file). Both are described below in more detail:
Here is an example raw single-FAST5 file:
PreviousReadInfo/ # useless, legacy UniqueGlobalKey/ context_tags/ # general metadata tracking_id/ # device metadata channel_id/ # channel metadata Raw/ Reads/ Read_418/ # read-specific metadata Signal # raw current data
context_tagsgroup stores general metadata such as the
tracking_idgroup stores device metadata such as the
heatsink_tempand lots of software versioning information.
channel_idgroup stores channel metadata such as the
Read_<#>group stores read-specific metadata such as the
start_time, and contains the raw
Signaldata. These are the raw current values measured by the nanopore device.
As far as I can tell,
PreviousReadInfo contains no information (perhaps it's used to identify the complement in 1D2 sequencing?).
If the file has been basecalled with Guppy (using
--fast5-out), the single-FAST5 file may include the additional groups:
Analyses/ Segmentation_000/ Summary/ segmentation/ # read size Basecall_1D_000/ Summary/ basecall_1d_template/ # read summary BaseCalled_template/ StateData # basecaller last layer Trace # base probabilities Move # base movements Fastq # output sequence Segmentation_001/ Basecall_1D_001/ ...
The suffix appended to each group begins at
000 and will count upwards each time additional basecalling data is appended to the FASTQ file.
segmentationgroup contains only the
Basecall_1D_000group contains basecalling metadata such as the basecaller
model_type, and a
basecall_1d_templategroup includes basecalling result summaries such as the
called_events, as well as the model's
If the basecaller was called with
--post-out, several additional fields will be present.
StateDatacontains the basecaller's last-layer transition probability predictions, as well as the
Tracecontains the flip-flop predicted base probabilities (
0-255) for each time step.
Movestores the number of bases the basecaller decided have been sequenced during each time step.
Fastqcontains the sequence of predicted bases, output in FASTQ format.
Legacy event-based basecallers such as Albacore 1.0 and event-based tools such as Tombo will append additional fields which include event detection and segmentation data, which I will not cover here.
Multi-FAST5 files are incredibly similar to single-FAST5 files, with a few minor changes. Most importantly, multiple reads (usually 4000) are stored per file, reducing disk storage.
read_e6b2e0e3-e8ab-4158-8dec-9b594db0b508/ context_tags/ tracking_id/ channel_id/ Raw/ Signal Analyses/ ... read_6eac07d9-74b5-483f-a5f9-3dcccb606bac/ ...
read_<read_id> group stores run metadata such as the
Within this group, most information is stored in the same format.
The main notable difference that I have found is that
Raw group stores the same information previously stored under
ONT provides a Python library called ont_fast5_api for working with FAST5 files, which includes a
single_to_multi_fast5.py conversion script (which unfortunately doesn't work at the time of writing).
Personally, I have found
ont_fast5_api to be too inflexible for research purposes and have instead used h5py.
There exist several useful HDF5 command-line tools such as
h5dump, in addition to graphical HDF5 viewers such as hdfview and vitables.
The FASTA format is used to represent amino or nucleic acid sequences. For each sequence:
>and is folowed by a sequence identifier and optional description.
Here is a simple FASTA file:
>sequence1 ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG >sequence2 AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT
A sample NCBI sequence obtained from GenInfo and GenBank might look as follows:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus] LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX IENY
Note that it is common practice to restrict FASTA files to at most 80 characters per line. Additionally, while the sequence identifier may contain arbitrary text, the initial
> cannot be followed by a space. Some parsers will also consider any text after a space on this line to be a comment, which is why NCBI uses the
There are several commonly-used FASTA file extensions, each with slightly different meanings.
||FASTA containing nucleic acids|
||FASTA containing protein-coding nucleotides|
||FASTA containing amino acids|
||FASTA containing non-coding RNA|
FASTQ is an extension of the FASTA format (used for storing nucleotide or amino acid sequences), which includes quality scores. For each sequence:
@and is followed by a sequence identifier and optional description.
+and is optionally (usually not) followed by the same sequence identifier.
Here is a simple FASTQ file:
@sequence1 ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG + !@#%!ADN#$#!@#%%sDFGN!#$%!GQWEWG @sequence2 AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT + sahjdlfhasdGAEHKWEKNVLAEW?!@#($%#$(%c$(#$*2
The SAM (Sequence Alignment Map) format is a text-based format for storing the alignments of nucleotide sequences. The SAMv1 specification can be found here, and its compressed binary equivalent is BAM. Every SAM file contains both a header and alignment section; I'll start by explaining the alignment section.
Each line in this section consists 11 tab-delimited mandatory fields, followed by a variable number of optional tags. An aligned nucleotide sequence is referred to as a "query" or "read", whereas the corresponding section of the reference sequence is known as the "template". There may be multiple lines corresponding to a single read if there are multiple or chimeric mappings. A chimeric mapping occurs when there is a non-linear alignment of a query to the reference. This usually happens when a read spans a structural variant such as an inversion mutation.
|1||Query NAME||string||query name, non-unique since there may be multiple alignments of the same query.|
|2||bitwise FLAG||integer||16-bit FLAG, described in more detail below.|
|3||Ref sequence NAME||string||reference sequence name (which must be included in the @SQ SN tag).|
|4||mapping POSition||integer||1-based leftmost mapping position, 0 for an unmapped query.|
|5||MAPping Quality||integer||Quality score encoding mapping confidence.|
|6||CIGAR string||string||CIGAR string encoding alignment of read to reference.|
|7||Ref name of NEXT read||string||Name of the next read (or the read's mate).|
|8||Position of NEXT read||integer||1-based leftmost mapping position of the next read (or the read's mate).|
|9||Template LENgth||integer||The (inclusive) length of the reference section (template) the read is aligned to.|
|10||acid code SEQuence||string||A string representing the nucleic or amino acids which compose the read.|
|11||encoded QUALity scores||string||ASCII-encoded quality scores for each called base.|
Additional tab-delimited fields can be included which follow the
As with header tags, the
TAG must be two letters, and lowercase letters are reserved for end users.
The most common types are integer (
i), float (
f), character (
A), and string (
The FLAG field stores 12 boolean flags as a single 12-bit integer in a bitwise OR format.
For example, if the 1st, 2nd, and 5th boolean flags are
True and all other flags are
False, then the FLAG field will be: 20+21+24 = 1+2+16 = 19.
The most important flags (according to me) in the following table are shown in bold.
|1||0000 0000 0001||1||template has multiple segments|
|2||0000 0000 0010||2||each segment aligns properly|
|3||0000 0000 0100||4||segment unmapped|
|4||0000 0000 1000||8||RNEXT segment unmapped|
|5||0000 0001 0000||16||SEQ is reverse complemented|
|6||0000 0010 0000||32||SEQ of RNEXT is reverse complemented|
|7||0000 0100 0000||64||the first segment in the template|
|8||0000 1000 0000||128||the last segment in the template|
|9||0001 0000 0000||256||secondary alignment (multiple alignments possible, this one isn't primary)|
|10||0010 0000 0000||512||failed filtering stage, such as average Qscore|
|11||0100 0000 0000||1024||PCR or optical duplicate|
|12||1000 0000 0000||2048||supplementary (chimeric) alignment|
read001 89 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAGGATACTG * read002 0 ref 9 30 2S6M1P1I4M * 0 0 AAAGATAAGGATA * read003 0 ref 9 30 5S6M * 0 0 GCCTAAGCTAA * SA:Z:ref,29,-,6H5M,17,0 read004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC * read003 2064 ref 29 17 6H5M * 0 0 TAGGC * SA:Z:ref,9,+,5H6M,30,1 read001 147 ref 37 30 9M = 7 -39 CAGCGGCAT * NM:i:1
Each header line begins with the character
@ followed by one of the following two-letter codes:
|SQ||SeQuence||Reference sequence information.|
|RG||Read Group||Read group information.|
|CO||COmment||Single-line text comment, with no formatting otherwise enforced.|
Following the two-letter code, each header line (excluding comments) consists of two-letter tab-delimited tags in the format
Many standard tags (some of them mandatory) are defined in the SAM specification.
You may also define your own (lower case) header tags.
Here are the most commonly used header tags:
|@HD VN||VersioN||SAM specification version. Mandatory.|
|@HD SO||Sorting Order||Sorting order of alignments; one of:
|@SQ SN||Sequence Name||Reference sequence name. Mandatory.|
|@SQ LN||sequence LeNgth||Reference sequence length. Mandatory.|
|@RG ID||IDentifier||Read group identifier. Mandatory if RG is present.|
|@PG ID||IDentifier||Program identifier. Mandatory if PG is present.|
|@PG CL||Command Line||Command line arguments of program which generated SAM file.|
@HD VN:1.6 SO:coordinate @SQ SN:chr20 LN:64444167 my:custom tag data here @PG ID:aligner VN:1.2.3 CL:/usr/bin/aligner reads.fastq ref.fasta --eqx
The BAM (Binary Alignment Map) format is the compressed binary equivalent of the SAM format. As such, it is designed for computer (not human) manipulation, and I will not go into detail when the specification can be found here.
The BED (Browser Extensible Data) format is used to store/annotate regions of a reference FASTA/FASTQ file. It is a tab-separated format consisting of twelve fields, and only the first three are mandatory.
|1||chromosome||chromosome, contig, or scaffold name|
|2||start||0-based start coordinate for region|
|3||end||non-inclusive end coordinate for region|
|4||name||name of the region|
|5||score||score (between 0 and 1000 for older formats)|
|6||strand||DNA strand orientation: positive (
|7||thick_start||start coordinate for thicker GUI display of region|
|8||thick_end||end coordinate for thicker GUI display of region|
|9||RGB||RGB value in the format
|10||block_count||number of blocks in this region|
|11||block_sizes||comma-separated list of sizes for
|12||block_starts||comma-separated list of
The number of columns used is occasionally part of the file extension: for example,
A powerful tool for working with BED files is bedtools, which only uses the first six columns.
The VCF (Variant Call Format) is a text-based format used to store arbitrary genome and genotype variants with respect to a reference FASTA.
It consists of meta-information lines (prefixed with
##), a mandatory fixed header (prefixed with
#), and tab-delimited data lines. Empty fields use
. as a placeholder.
Since the contents of the meta-information and data lines are highly interdependent, it makes sense to first show an example VCF file:
##fileformat=VCFv4.3 ##fileDate=20210115 ##source=VariantCallerScript ##reference=file:///references/HG38.fasta ##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens"> ##phasing=partial ##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth"> ##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency"> ##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele"> ##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129"> ##FILTER=<ID=q10,Description="Quality below 10"> ##FILTER=<ID=s50,Description="Less than 50% of samples have data"> ##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype"> ##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth"> ##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality"> #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 20 14370 rs6 G A 29 PASS DP=14;DB; GT:DP:HQ 0|0:1:51,51 1|0:8:51,51 20 17330 . T A 3 q10 DP=11;AF=0.017 GT:DP:HQ 0|0:3:58,50 0|1:5:65,3 20 1110696 rs5 A G,T 67 PASS DP=10;AA=T;DB GT:DP:HQ 1|2:6:23,27 2|1:0:18,2 20 1230237 . T . 47 PASS DP=13;AA=T GT:DP:HQ 0|0:7:56,60 0|0:4:51,51 20 1234567 ms1 GTC G,GTCT 50 PASS DP=9;AA=G GT:DP 0/1:4 0/2:2
As you can see, the
FORMAT "structured" lines in the meta-header are used to define corresponding fields in the data section.
Each meta-information line must be in the format
They can be in any order with the exception of
##fileformat=v4.3, which must be first.
Structured lines (such as
FORMAT) require the following format:
ID is a unique two-character identifier for the new field. Many commonly-used fields have standard abbreviations.
Number determines the number of comma-separated data values stored for each entry. The supported data
Flag is commonly used with
Ploidy is an example optional argument, which must have a
The header line is tab-delimited and names the 8 fixed mandatory columns:
#CHROM POS ID REF ALT QUAL FILTER INFO
|CHROM||The CHROMosome/contig of the variant.|
|POS||The 1-based POSition of variant.|
|ID||An optional IDentifier for the variant.|
|REF||The REFerence allele (forward strand).|
|ALT||The ALTernate allele (forward strand).|
|QUAL||The Phred QUALity score for the variant.|
|FILTER||The name of any failed FILTERs, or the value
|INFO||Optional additional INFOrmation.|
If genotype data is present, these columns are followed by an additional
FORMAT column header and an arbitrary number of sample IDs (e.g.
Although most columns are explained sufficiently by the table above, others are more complicated.
The following table illustrates how different variant types would be represented using the
|A||C||Single-nucleotide polymorphism (SNP), or substitution.|
|A||ATTT||Insertion (of TTT), preceding A included.|
|AG||A||Deletion (of G), preceding A included.|
|AG||A,AGAG||Complex variant (diploid organism). Insertion for one allele, deletion for the other.|
|AG||CT||Multi-nucleotide polymorphism (MNT). Most tools now report as two SNPs.|
Annotations in the
INFO field are tag-value pairs, where tags and values are separated by an equals sign (
=), and each pair is separated by a semicolon (
FORMAT column contains 2-letter identifier values separated by colons (
:), which correspond to data values in the
SAMPLE column (also separated by colons). The
Number of each data value (separated by commas), is given by the meta-information header. Here are some commonly used ones:
|AD||Unfiltered Allele Depth, number of reads supporting each allele.|
|GT||GenoType, described in more detail below.|
|DP||Filtered DePth, number of reads supporting each allele.|
|GQ||Phred Genotype Quality, confidence that
|PL||Phred Likelihoods of the possible genotypes.|
With polyploid genomes, variants do not necessarily occur on every copy of the genome. The
GT tag contains information regarding this.
|0/0||homozygous reference||No mutations (REF/REF).|
|0/1||heterozygous||Mutation on one copy (REF/ALT1).|
|1/1||homozygous alternate||Same mutation on each copy (ALT1/ALT1).|
|1/2||heterozygous alternate||Different mutation on each copy (ALT1/ALT2).|
Additionally, the delimiter typically changes from a slash (
/) to a pipe (
|) when the haplotypes have been phased. This means that instead of knowing that the variant occurs on one of two haplotypes, you also know which haplotype it occurred on. The following table shows how this works.
|REF||ALT||GT||Hap 1||Hap 2|
bcftools view file.vcf -Oz -o <filename>.vcf.gz bcftools index <filename>.vcf.gz
bgzip <filename.vcf> tabix -p vcf <filename>.vcf.gz