BioInformatics File Formats

FAST5
Background

FAST5 is a data format developed by Oxford Nanopore Technologies (ONT), a specific HDF5 file structure designed to store raw nanopore current data in addition to flow cell metadata and optional basecalling results. The Hierarchical Data Format (HDF) was designed to allow efficient storage and access of arbitrarily structured datasets. ONT has defined two types of FAST5 files: single-FAST5s (a deprecated space-inefficient format which stores a single read per file), and multi-FAST5s (which store many, usually 4000, reads per file). Both are described below in more detail:

Plain Single-FAST5 Files

Here is an example raw single-FAST5 file:

PreviousReadInfo/		# useless, legacy
UniqueGlobalKey/
	context_tags/		# general metadata
	tracking_id/		# device metadata
	channel_id/			# channel metadata
Raw/
	Reads/
		Read_418/		# read-specific metadata
			Signal		# raw current data
  • The context_tags group stores general metadata such as the filename, experiment_kit, and experiment_type.
  • The tracking_id group stores device metadata such as the asic_version, device_type, flow_cell_id, heatsink_temp and lots of software versioning information.
  • The channel_id group stores channel metadata such as the channel_number, sampling_rate, digitisation, offset, and range.
  • The Read_<#> group stores read-specific metadata such as the duration and start_time, and contains the raw Signal data. These are the raw current values measured by the nanopore device.

As far as I can tell, PreviousReadInfo contains no information (perhaps it's used to identify the complement in 1D2 sequencing?).

Supplementary Basecaller Data

If the file has been basecalled with Guppy (using --fast5-out), the single-FAST5 file may include the additional groups:

Analyses/
	Segmentation_000/
		Summary/
			segmentation/			# read size
	Basecall_1D_000/
		Summary/
			basecall_1d_template/	# read summary
		BaseCalled_template/
			StateData				# basecaller last layer
			Trace					# base probabilities
			Move					# base movements
			Fastq					# output sequence

	Segmentation_001/
	Basecall_1D_001/
	...

The suffix appended to each group begins at 000 and will count upwards each time additional basecalling data is appended to the FASTQ file.

  • The segmentation group contains only the has_template, duration_template and first_sample_template attributes.
  • The Basecall_1D_000 group contains basecalling metadata such as the basecaller version, model_type, and a time_stamp.
  • The basecall_1d_template group includes basecalling result summaries such as the mean_qscore, sequence_length, and called_events, as well as the model's block_stride.

If the basecaller was called with --post-out, several additional fields will be present.

  • StateData contains the basecaller's last-layer transition probability predictions, as well as the scale and offset.
  • Trace contains the flip-flop predicted base probabilities (0-255) for each time step.
  • Move stores the number of bases the basecaller decided have been sequenced during each time step.
  • Fastq contains the sequence of predicted bases, output in FASTQ format.

Legacy event-based basecallers such as Albacore 1.0 and event-based tools such as Tombo will append additional fields which include event detection and segmentation data, which I will not cover here.

Multi-FAST5 Files

Multi-FAST5 files are incredibly similar to single-FAST5 files, with a few minor changes. Most importantly, multiple reads (usually 4000) are stored per file, reducing disk storage.

read_e6b2e0e3-e8ab-4158-8dec-9b594db0b508/
	context_tags/
	tracking_id/
	channel_id/
	Raw/
		Signal
	Analyses/
		...

read_6eac07d9-74b5-483f-a5f9-3dcccb606bac/
...

The read_<read_id> group stores run metadata such as the pore_type and run_id. Within this group, most information is stored in the same format. The main notable difference that I have found is that Raw group stores the same information previously stored under Raw/Reads/Read_<#>

Tools

ONT provides a Python library called ont_fast5_api for working with FAST5 files, which includes a single_to_multi_fast5.py conversion script (which unfortunately doesn't work at the time of writing). Personally, I have found ont_fast5_api to be too inflexible for research purposes and have instead used h5py. There exist several useful HDF5 command-line tools such as h5ls and h5dump, in addition to graphical HDF5 viewers such as hdfview and vitables.

FASTA
Basic FASTA

The FASTA format is used to represent amino or nucleic acid sequences. For each sequence:

  • Line 1 begins with > and is folowed by a sequence identifier and optional description.
  • Line 2 is the sequence of base pair or amino acid single-letter codes (found here).

Here is a simple FASTA file:

>sequence1
ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG
>sequence2
AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT

Standard Practices

NCBI has defined a set of standard identifiers used to map sequences obtained from databases back to the original database record.

A sample NCBI sequence obtained from GenInfo and GenBank might look as follows:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Note that it is common practice to restrict FASTA files to at most 80 characters per line. Additionally, while the sequence identifier may contain arbitrary text, the initial > cannot be followed by a space. Some parsers will also consider any text after a space on this line to be a comment, which is why NCBI uses the | delimiter.

There are several commonly-used FASTA file extensions, each with slightly different meanings.

Extension Meaning
.fasta generic FASTA
.fa generic FASTA
.fna FASTA containing nucleic acids
.ffn FASTA containing protein-coding nucleotides
.faa FASTA containing amino acids
.frn FASTA containing non-coding RNA
FASTQ
Basic FASTQ

FASTQ is an extension of the FASTA format (used for storing nucleotide or amino acid sequences), which includes quality scores. For each sequence:

  • Line 1 begins with @ and is followed by a sequence identifier and optional description.
  • Line 2 is the sequence of base pair or amino acid single-letter codes (found here).
  • Line 3 begins with + and is optionally (usually not) followed by the same sequence identifier.
  • Line 4 encodes the quality scores for the sequence in Line 2, and must contain the same number of symbols.

Here is a simple FASTQ file:

@sequence1
ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG
+
!@#%!ADN#$#!@#%%sDFGN!#$%!GQWEWG
@sequence2
AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT
+
sahjdlfhasdGAEHKWEKNVLAEW?!@#($%#$(%c$(#$*2

SAM

The SAM (Sequence Alignment Map) format is a text-based format for storing the alignments of nucleotide sequences. The SAMv1 specification can be found here, and its compressed binary equivalent is BAM. Every SAM file contains both a header and alignment section; I'll start by explaining the alignment section.

SAM Alignment Data Section

Each line in this section consists 11 tab-delimited mandatory fields, followed by a variable number of optional tags. An aligned nucleotide sequence is referred to as a "query" or "read", whereas the corresponding section of the reference sequence is known as the "template". There may be multiple lines corresponding to a single read if there are multiple or chimeric mappings. A chimeric mapping occurs when there is a non-linear alignment of a query to the reference. This usually happens when a read spans a structural variant such as an inversion mutation.

Column Field Type Description
1 Query NAME string query name, non-unique since there may be multiple alignments of the same query.
2 bitwise FLAG integer 16-bit FLAG, described in more detail below.
3 Ref sequence NAME string reference sequence name (which must be included in the @SQ SN tag).
4 mapping POSition integer 1-based leftmost mapping position, 0 for an unmapped query.
5 MAPping Quality integer Quality score encoding mapping confidence.
6 CIGAR string string CIGAR string encoding alignment of read to reference.
7 Ref name of NEXT read string Name of the next read (or the read's mate).
8 Position of NEXT read integer 1-based leftmost mapping position of the next read (or the read's mate).
9 Template LENgth integer The (inclusive) length of the reference section (template) the read is aligned to.
10 acid code SEQuence string A string representing the nucleic or amino acids which compose the read.
11 encoded QUALity scores string ASCII-encoded quality scores for each called base.

Optional Alignment Fields

Additional tab-delimited fields can be included which follow the TAG:TYPE:VALUE format. As with header tags, the TAG must be two letters, and lowercase letters are reserved for end users. The most common types are integer (i), float (f), character (A), and string (Z).

SAM Flags

The FLAG field stores 12 boolean flags as a single 12-bit integer in a bitwise OR format. For example, if the 1st, 2nd, and 5th boolean flags are True and all other flags are False, then the FLAG field will be: 20+21+24 = 1+2+16 = 19. The most important flags (according to me) in the following table are shown in bold.

Flag Binary Decimal Description
1 0000 0000 0001 1 template has multiple segments
2 0000 0000 0010 2 each segment aligns properly
3 0000 0000 0100 4 segment unmapped
4 0000 0000 1000 8 RNEXT segment unmapped
5 0000 0001 0000 16 SEQ is reverse complemented
6 0000 0010 0000 32 SEQ of RNEXT is reverse complemented
7 0000 0100 0000 64 the first segment in the template
8 0000 1000 0000 128 the last segment in the template
9 0001 0000 0000 256 secondary alignment (multiple alignments possible, this one isn't primary)
10 0010 0000 0000 512 failed filtering stage, such as average Qscore
11 0100 0000 0000 1024 PCR or optical duplicate
12 1000 0000 0000 2048 supplementary (chimeric) alignment

Example Alignments
read001	89		ref	7	30	8M2I4M1D3M	=	37	39	TTAGATAAGGATACTG	*
read002	0		ref	9	30	2S6M1P1I4M	*	0	0	AAAGATAAGGATA		*
read003	0		ref	9	30	5S6M		*	0	0	GCCTAAGCTAA			*	SA:Z:ref,29,-,6H5M,17,0
read004	0		ref	16	30	6M14N5M		*	0	0	ATAGCTTCAGC			*
read003	2064	ref	29	17	6H5M		* 	0	0	TAGGC				*	SA:Z:ref,9,+,5H6M,30,1
read001	147		ref	37	30	9M			=	7	-39	CAGCGGCAT			*	NM:i:1

SAM Header Section

Each header line begins with the character @ followed by one of the following two-letter codes:

Code Name Description
HD HeaDer File metadata.
SQ SeQuence Reference sequence information.
RG Read Group Read group information.
PG ProGram Program information.
CO COmment Single-line text comment, with no formatting otherwise enforced.

Following the two-letter code, each header line (excluding comments) consists of two-letter tab-delimited tags in the format TAG:VALUE. Many standard tags (some of them mandatory) are defined in the SAM specification. You may also define your own (lower case) header tags. Here are the most commonly used header tags:

Tag Name Description
@HD VN VersioN SAM specification version. Mandatory.
@HD SO Sorting Order Sorting order of alignments; one of: unknown (default), unsorted, query name, and coordinate.
@SQ SN Sequence Name Reference sequence name. Mandatory.
@SQ LN sequence LeNgth Reference sequence length. Mandatory.
@RG ID IDentifier Read group identifier. Mandatory if RG is present.
@PG ID IDentifier Program identifier. Mandatory if PG is present.
@PG CL Command Line Command line arguments of program which generated SAM file.

Example Header
@HD	VN:1.6	SO:coordinate
@SQ	SN:chr20	LN:64444167	my:custom tag data here
@PG	ID:aligner	VN:1.2.3	CL:/usr/bin/aligner reads.fastq ref.fasta --eqx

BAM

The BAM (Binary Alignment Map) format is the compressed binary equivalent of the SAM format. As such, it is designed for computer (not human) manipulation, and I will not go into detail when the specification can be found here.

For performing standard operations on BAM files, I would recommend using samtools. For custom functionality, consider either pysam for convenience or htslib for speed.

BED

The BED (Browser Extensible Data) format is used to store/annotate regions of a reference FASTA/FASTQ file. It is a tab-separated format consisting of twelve fields, and only the first three are mandatory.

Column Field Description
1 chromosome chromosome, contig, or scaffold name
2 start 0-based start coordinate for region
3 end non-inclusive end coordinate for region
4 name name of the region
5 score score (between 0 and 1000 for older formats)
6 strand DNA strand orientation: positive (+), negative (-), or unknown (.)
7 thick_start start coordinate for thicker GUI display of region
8 thick_end end coordinate for thicker GUI display of region
9 RGB RGB value in the format R,G,B for GUI display of region
10 block_count number of blocks in this region
11 block_sizes comma-separated list of sizes for block_count blocks
12 block_starts comma-separated list of block_count block coordinates, relative to start

The number of columns used is occasionally part of the file extension: for example, .bed6. A powerful tool for working with BED files is bedtools, which only uses the first six columns.

VCF

The VCF (Variant Call Format) is a text-based format used to store arbitrary genome and genotype variants with respect to a reference FASTA. It consists of meta-information lines (prefixed with ##), a mandatory fixed header (prefixed with #), and tab-delimited data lines. Empty fields use . as a placeholder. Since the contents of the meta-information and data lines are highly interdependent, it makes sense to first show an example VCF file:

##fileformat=VCFv4.3
##fileDate=20210115
##source=VariantCallerScript
##reference=file:///references/HG38.fasta
##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens">
##phasing=partial
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID  REF ALT    QUAL FILTER INFO           FORMAT   NA00001     NA00002
20     14370   rs6 G   A      29   PASS   DP=14;DB;      GT:DP:HQ 0|0:1:51,51 1|0:8:51,51
20     17330   .   T   A       3   q10    DP=11;AF=0.017 GT:DP:HQ 0|0:3:58,50 0|1:5:65,3
20     1110696 rs5 A   G,T    67   PASS   DP=10;AA=T;DB  GT:DP:HQ 1|2:6:23,27 2|1:0:18,2
20     1230237     .   T .    47   PASS   DP=13;AA=T     GT:DP:HQ 0|0:7:56,60 0|0:4:51,51
20     1234567 ms1 GTC G,GTCT 50   PASS   DP=9;AA=G      GT:DP    0/1:4       0/2:2

As you can see, the INFO, FILTER, and FORMAT "structured" lines in the meta-header are used to define corresponding fields in the data section.

VCF Meta-Information Section

Each meta-information line must be in the format ##KEY=VALUE. They can be in any order with the exception of ##fileformat=v4.3, which must be first. Structured lines (such as INFO, FILTER, and FORMAT) require the following format:

##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype",Ploidy="3">

ID is a unique two-character identifier for the new field. Many commonly-used fields have standard abbreviations. Number determines the number of comma-separated data values stored for each entry. The supported data Types are Integer, Float, Flag, Character, and String. Flag is commonly used with Number=0. Ploidy is an example optional argument, which must have a String value.

VCF Header Section

The header line is tab-delimited and names the 8 fixed mandatory columns:

#CHROM   POS   ID   REF   ALT   QUAL   FILTER   INFO 
Column Description
CHROM The CHROMosome/contig of the variant.
POS The 1-based POSition of variant.
ID An optional IDentifier for the variant.
REF The REFerence allele (forward strand).
ALT The ALTernate allele (forward strand).
QUAL The Phred QUALity score for the variant.
FILTER The name of any failed FILTERs, or the value PASS.
INFO Optional additional INFOrmation.

If genotype data is present, these columns are followed by an additional FORMAT column header and an arbitrary number of sample IDs (e.g. SAMPLE, HG001, HG002)

VCF Data Section

Although most columns are explained sufficiently by the table above, others are more complicated.

Variant Types: REF/ALT

The following table illustrates how different variant types would be represented using the REF and ALT columns.

REF ALT Description
A C Single-nucleotide polymorphism (SNP), or substitution.
A ATTT Insertion (of TTT), preceding A included.
AG A Deletion (of G), preceding A included.
AG A,AGAG Complex variant (diploid organism). Insertion for one allele, deletion for the other.
AG CT Multi-nucleotide polymorphism (MNT). Most tools now report as two SNPs.

Annotations: INFO

Annotations in the INFO field are tag-value pairs, where tags and values are separated by an equals sign (=), and each pair is separated by a semicolon (;).

Sample Data: FORMAT/SAMPLE

The FORMAT column contains 2-letter identifier values separated by colons (:), which correspond to data values in the SAMPLE column (also separated by colons). The Type and Number of each data value (separated by commas), is given by the meta-information header. Here are some commonly used ones:

Tag Description
AD Unfiltered Allele Depth, number of reads supporting each allele.
GT GenoType, described in more detail below.
DP Filtered DePth, number of reads supporting each allele.
GQ Phred Genotype Quality, confidence that GT is correct, derived from PL.
PL Phred Likelihoods of the possible genotypes.

Genotypes: GT

With polyploid genomes, variants do not necessarily occur on every copy of the genome. The GT tag contains information regarding this.

GT Name Description
0/0 homozygous reference No mutations (REF/REF).
0/1 heterozygous Mutation on one copy (REF/ALT1).
1/1 homozygous alternate Same mutation on each copy (ALT1/ALT1).
1/2 heterozygous alternate Different mutation on each copy (ALT1/ALT2).

Additionally, the delimiter typically changes from a slash (/) to a pipe (|) when the haplotypes have been phased. This means that instead of knowing that the variant occurs on one of two haplotypes, you also know which haplotype it occurred on. The following table shows how this works.

REF ALT GT Hap 1 Hap 2
A C 0|0 A A
A C 0|1 A C
A C 1|0 C A
A C 1|1 C C
A C,T 1|2 C T
A C,T 2|1 T C

TBI

Tabix is a binary file format commonly used for indexing compressed VCF files. Its full specification can be found here. Indexing and compression can either be performed using bcftools:

bcftools view file.vcf -Oz -o <filename>.vcf.gz
bcftools index <filename>.vcf.gz

or using the stand-alone bgzip and tabix programs:

bgzip <filename.vcf>
tabix -p vcf <filename>.vcf.gz