TimD.one

BioInformatics File Formats

FAST5

Background

FAST5 is a data format developed by Oxford Nanopore Technologies (ONT), a specific HDF5 file structure designed to store raw nanopore current data in addition to flow cell metadata and optional basecalling results. The Hierarchical Data Format (HDF) was designed to allow efficient storage and access of arbitrarily structured datasets. ONT has defined two types of FAST5 files: single-FAST5s (a deprecated space-inefficient format which stores a single read per file), and multi-FAST5s (which store many, usually 4000, reads per file). Both are described below in more detail:

Plain Single-FAST5 Files

Here is an example raw single-FAST5 file:

PreviousReadInfo/		# useless, legacy
UniqueGlobalKey/
	context_tags/		# general metadata
	tracking_id/		# device metadata
	channel_id/			# channel metadata
Raw/
	Reads/
		Read_418/		# read-specific metadata
			Signal		# raw current data

The context_tags group stores general metadata such as the filename, experiment_kit, and experiment_type.
The tracking_id group stores device metadata such as the asic_version, device_type, flow_cell_id, heatsink_temp and lots of software versioning information.
The channel_id group stores channel metadata such as the channel_number, sampling_rate, digitisation, offset, and range.
The Read_<#> group stores read-specific metadata such as the duration and start_time, and contains the raw Signal data. These are the raw current values measured by the nanopore device.

As far as I can tell, PreviousReadInfo contains no information (perhaps it's used to identify the complement in 1D² sequencing?).

Supplementary Basecaller Data

If the file has been basecalled with Guppy (using --fast5-out), the single-FAST5 file may include the additional groups:

Analyses/
	Segmentation_000/
		Summary/
			segmentation/			# read size
	Basecall_1D_000/
		Summary/
			basecall_1d_template/	# read summary
		BaseCalled_template/
			StateData				# basecaller last layer
			Trace					# base probabilities
			Move					# base movements
			Fastq					# output sequence

	Segmentation_001/
	Basecall_1D_001/
	...

The suffix appended to each group begins at 000 and will count upwards each time additional basecalling data is appended to the FASTQ file.

The segmentation group contains only the has_template, duration_template and first_sample_template attributes.
The Basecall_1D_000 group contains basecalling metadata such as the basecaller version, model_type, and a time_stamp.
The basecall_1d_template group includes basecalling result summaries such as the mean_qscore, sequence_length, and called_events, as well as the model's block_stride.

If the basecaller was called with --post-out, several additional fields will be present.

StateData contains the basecaller's last-layer transition probability predictions, as well as the scale and offset.
Trace contains the flip-flop predicted base probabilities (0-255) for each time step.
Move stores the number of bases the basecaller decided have been sequenced during each time step.
Fastq contains the sequence of predicted bases, output in FASTQ format.

Legacy event-based basecallers such as Albacore 1.0 and event-based tools such as Tombo will append additional fields which include event detection and segmentation data, which I will not cover here.

Multi-FAST5 Files

Multi-FAST5 files are incredibly similar to single-FAST5 files, with a few minor changes. Most importantly, multiple reads (usually 4000) are stored per file, reducing disk storage.

read_e6b2e0e3-e8ab-4158-8dec-9b594db0b508/
	context_tags/
	tracking_id/
	channel_id/
	Raw/
		Signal
	Analyses/
		...

read_6eac07d9-74b5-483f-a5f9-3dcccb606bac/
...

The read_<read_id> group stores run metadata such as the pore_type and run_id. Within this group, most information is stored in the same format. The main notable difference that I have found is that Raw group stores the same information previously stored under Raw/Reads/Read_<#>

Tools

ONT provides a Python library called ont_fast5_api for working with FAST5 files, which includes a single_to_multi_fast5.py conversion script (which unfortunately doesn't work at the time of writing). Personally, I have found ont_fast5_api to be too inflexible for research purposes and have instead used h5py. There exist several useful HDF5 command-line tools such as h5ls and h5dump, in addition to graphical HDF5 viewers such as hdfview and vitables.

FASTA

Basic FASTA

The FASTA format is used to represent amino or nucleic acid sequences. For each sequence:

Line 1 begins with > and is folowed by a sequence identifier and optional description.
Line 2 is the sequence of base pair or amino acid single-letter codes (found here).

Here is a simple FASTA file:

>sequence1
ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG
>sequence2
AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT

Standard Practices

NCBI has defined a set of standard identifiers used to map sequences obtained from databases back to the original database record.

A sample NCBI sequence obtained from GenInfo and GenBank might look as follows:

>gi|5524211|gb|AAD44166.1| cytochrome b [Elephas maximus maximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Note that it is common practice to restrict FASTA files to at most 80 characters per line. Additionally, while the sequence identifier may contain arbitrary text, the initial > cannot be followed by a space. Some parsers will also consider any text after a space on this line to be a comment, which is why NCBI uses the | delimiter.

There are several commonly-used FASTA file extensions, each with slightly different meanings.

Extension	Meaning
`.fasta`	generic FASTA
`.fa`	generic FASTA
`.fna`	FASTA containing nucleic acids
`.ffn`	FASTA containing protein-coding nucleotides
`.faa`	FASTA containing amino acids
`.frn`	FASTA containing non-coding RNA

FASTQ

Basic FASTQ

FASTQ is an extension of the FASTA format (used for storing nucleotide or amino acid sequences), which includes quality scores. For each sequence:

Line 1 begins with @ and is followed by a sequence identifier and optional description.
Line 2 is the sequence of base pair or amino acid single-letter codes (found here).
Line 3 begins with + and is optionally (usually not) followed by the same sequence identifier.
Line 4 encodes the quality scores for the sequence in Line 2, and must contain the same number of symbols.

Here is a simple FASTQ file:

@sequence1
ACGTTAGAGAGCTTCAGGCAGCTCTCTTAGAG
+
!@#%!ADN#$#!@#%%sDFGN!#$%!GQWEWG
@sequence2
AGGGGCGCCCTCCTATATTATTAATCGGCATACGACTACGCAT
+
sahjdlfhasdGAEHKWEKNVLAEW?!@#($%#$(%c$(#$*2

SAM

The SAM (Sequence Alignment Map) format is a text-based format for storing the alignments of nucleotide sequences. The SAMv1 specification can be found here, and its compressed binary equivalent is BAM. Every SAM file contains both a header and alignment section; I'll start by explaining the alignment section.

SAM Alignment Data Section

Each line in this section consists 11 tab-delimited mandatory fields, followed by a variable number of optional tags. An aligned nucleotide sequence is referred to as a "query" or "read", whereas the corresponding section of the reference sequence is known as the "template". There may be multiple lines corresponding to a single read if there are multiple or chimeric mappings. A chimeric mapping occurs when there is a non-linear alignment of a query to the reference. This usually happens when a read spans a structural variant such as an inversion mutation.

Column	Field	Type	Description
1	Query NAME	string	query name, non-unique since there may be multiple alignments of the same query.
2	bitwise FLAG	integer	16-bit FLAG, described in more detail below.
3	Ref sequence NAME	string	reference sequence name (which must be included in the @SQ SN tag).
4	mapping POSition	integer	1-based leftmost mapping position, 0 for an unmapped query.
5	MAPping Quality	integer	Quality score encoding mapping confidence.
6	CIGAR string	string	CIGAR string encoding alignment of read to reference.
7	Ref name of NEXT read	string	Name of the next read (or the read's mate).
8	Position of NEXT read	integer	1-based leftmost mapping position of the next read (or the read's mate).
9	Template LENgth	integer	The (inclusive) length of the reference section (template) the read is aligned to.
10	acid code SEQuence	string	A string representing the nucleic or amino acids which compose the read.
11	encoded QUALity scores	string	ASCII-encoded quality scores for each called base.

Optional Alignment Fields

Additional tab-delimited fields can be included which follow the TAG:TYPE:VALUE format. As with header tags, the TAG must be two letters, and lowercase letters are reserved for end users. The most common types are integer (i), float (f), character (A), and string (Z).

SAM Flags

The FLAG field stores 12 boolean flags as a single 12-bit integer in a bitwise OR format. For example, if the 1^st, 2^nd, and 5^th boolean flags are True and all other flags are False, then the FLAG field will be: 2⁰+2¹+2⁴ = 1+2+16 = 19. The most important flags (according to me) in the following table are shown in bold.

Flag	Binary	Decimal	Description
1	0000 0000 0001	1	template has multiple segments
2	0000 0000 0010	2	each segment aligns properly
3	0000 0000 0100	4	segment unmapped
4	0000 0000 1000	8	RNEXT segment unmapped
5	0000 0001 0000	16	SEQ is reverse complemented
6	0000 0010 0000	32	SEQ of RNEXT is reverse complemented
7	0000 0100 0000	64	the first segment in the template
8	0000 1000 0000	128	the last segment in the template
9	0001 0000 0000	256	secondary alignment (multiple alignments possible, this one isn't primary)
10	0010 0000 0000	512	failed filtering stage, such as average Qscore
11	0100 0000 0000	1024	PCR or optical duplicate
12	1000 0000 0000	2048	supplementary (chimeric) alignment

Example Alignments

read001	89		ref	7	30	8M2I4M1D3M	=	37	39	TTAGATAAGGATACTG	*
read002	0		ref	9	30	2S6M1P1I4M	*	0	0	AAAGATAAGGATA		*
read003	0		ref	9	30	5S6M		*	0	0	GCCTAAGCTAA			*	SA:Z:ref,29,-,6H5M,17,0
read004	0		ref	16	30	6M14N5M		*	0	0	ATAGCTTCAGC			*
read003	2064	ref	29	17	6H5M		* 	0	0	TAGGC				*	SA:Z:ref,9,+,5H6M,30,1
read001	147		ref	37	30	9M			=	7	-39	CAGCGGCAT			*	NM:i:1

SAM Header Section

Each header line begins with the character @ followed by one of the following two-letter codes:

Code	Name	Description
HD	HeaDer	File metadata.
SQ	SeQuence	Reference sequence information.
RG	Read Group	Read group information.
PG	ProGram	Program information.
CO	COmment	Single-line text comment, with no formatting otherwise enforced.

Following the two-letter code, each header line (excluding comments) consists of two-letter tab-delimited tags in the format TAG:VALUE. Many standard tags (some of them mandatory) are defined in the SAM specification. You may also define your own (lower case) header tags. Here are the most commonly used header tags:

Tag	Name	Description
@HD VN	VersioN	SAM specification version. Mandatory.
@HD SO	Sorting Order	Sorting order of alignments; one of: `unknown` (default), `unsorted`, `query name`, and `coordinate`.
@SQ SN	Sequence Name	Reference sequence name. Mandatory.
@SQ LN	sequence LeNgth	Reference sequence length. Mandatory.
@RG ID	IDentifier	Read group identifier. Mandatory if RG is present.
@PG ID	IDentifier	Program identifier. Mandatory if PG is present.
@PG CL	Command Line	Command line arguments of program which generated SAM file.

Example Header

@HD	VN:1.6	SO:coordinate
@SQ	SN:chr20	LN:64444167	my:custom tag data here
@PG	ID:aligner	VN:1.2.3	CL:/usr/bin/aligner reads.fastq ref.fasta --eqx

BAM

The BAM (Binary Alignment Map) format is the compressed binary equivalent of the SAM format. As such, it is designed for computer (not human) manipulation, and I will not go into detail when the specification can be found here.

For performing standard operations on BAM files, I would recommend using samtools. For custom functionality, consider either pysam for convenience or htslib for speed.

BED

The BED (Browser Extensible Data) format is used to store/annotate regions of a reference FASTA/FASTQ file. It is a tab-separated format consisting of twelve fields, and only the first three are mandatory.

Column	Field	Description
1	chromosome	chromosome, contig, or scaffold name
2	start	0-based start coordinate for region
3	end	non-inclusive end coordinate for region
4	name	name of the region
5	score	score (between 0 and 1000 for older formats)
6	strand	DNA strand orientation: positive (`+`), negative (`-`), or unknown (`.`)
7	thick_start	start coordinate for thicker GUI display of region
8	thick_end	end coordinate for thicker GUI display of region
9	RGB	RGB value in the format `R,G,B` for GUI display of region
10	block_count	number of blocks in this region
11	block_sizes	comma-separated list of sizes for `block_count` blocks
12	block_starts	comma-separated list of `block_count` block coordinates, relative to `start`

The number of columns used is occasionally part of the file extension: for example, .bed6. A powerful tool for working with BED files is bedtools, which only uses the first six columns.

VCF

The VCF (Variant Call Format) is a text-based format used to store arbitrary genome and genotype variants with respect to a reference FASTA. It consists of meta-information lines (prefixed with ##), a mandatory fixed header (prefixed with #), and tab-delimited data lines. Empty fields use . as a placeholder. Since the contents of the meta-information and data lines are highly interdependent, it makes sense to first show an example VCF file:

##fileformat=VCFv4.3
##fileDate=20210115
##source=VariantCallerScript
##reference=file:///references/HG38.fasta
##contig=<ID=20,length=62435964,assembly=B36,species="Homo sapiens">
##phasing=partial
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID  REF ALT    QUAL FILTER INFO           FORMAT   NA00001     NA00002
20     14370   rs6 G   A      29   PASS   DP=14;DB;      GT:DP:HQ 0|0:1:51,51 1|0:8:51,51
20     17330   .   T   A       3   q10    DP=11;AF=0.017 GT:DP:HQ 0|0:3:58,50 0|1:5:65,3
20     1110696 rs5 A   G,T    67   PASS   DP=10;AA=T;DB  GT:DP:HQ 1|2:6:23,27 2|1:0:18,2
20     1230237     .   T .    47   PASS   DP=13;AA=T     GT:DP:HQ 0|0:7:56,60 0|0:4:51,51
20     1234567 ms1 GTC G,GTCT 50   PASS   DP=9;AA=G      GT:DP    0/1:4       0/2:2

As you can see, the INFO, FILTER, and FORMAT "structured" lines in the meta-header are used to define corresponding fields in the data section.

VCF Meta-Information Section

Each meta-information line must be in the format ##KEY=VALUE. They can be in any order with the exception of ##fileformat=v4.3, which must be first. Structured lines (such as INFO, FILTER, and FORMAT) require the following format:

##FORMAT=<ID=GT,Number=1,Type=Integer,Description="Genotype",Ploidy="3">

ID is a unique two-character identifier for the new field. Many commonly-used fields have standard abbreviations. Number determines the number of comma-separated data values stored for each entry. The supported data Types are Integer, Float, Flag, Character, and String. Flag is commonly used with Number=0. Ploidy is an example optional argument, which must have a String value.

VCF Header Section

The header line is tab-delimited and names the 8 fixed mandatory columns:

#CHROM   POS   ID   REF   ALT   QUAL   FILTER   INFO

Column	Description
CHROM	The CHROMosome/contig of the variant.
POS	The 1-based POSition of variant.
ID	An optional IDentifier for the variant.
REF	The REFerence allele (forward strand).
ALT	The ALTernate allele (forward strand).
QUAL	The Phred QUALity score for the variant.
FILTER	The name of any failed FILTERs, or the value `PASS`.
INFO	Optional additional INFOrmation.

If genotype data is present, these columns are followed by an additional FORMAT column header and an arbitrary number of sample IDs (e.g. SAMPLE, HG001, HG002)

VCF Data Section

Although most columns are explained sufficiently by the table above, others are more complicated.

Variant Types: `REF/ALT`

The following table illustrates how different variant types would be represented using the REF and ALT columns.

REF	ALT	Description
A	C	Single-nucleotide polymorphism (SNP), or substitution.
A	ATTT	Insertion (of TTT), preceding A included.
AG	A	Deletion (of G), preceding A included.
AG	A,AGAG	Complex variant (diploid organism). Insertion for one allele, deletion for the other.
AG	CT	Multi-nucleotide polymorphism (MNT). Most tools now report as two SNPs.

Annotations: `INFO`

Annotations in the INFO field are tag-value pairs, where tags and values are separated by an equals sign (=), and each pair is separated by a semicolon (;).

Sample Data: `FORMAT/SAMPLE`

The FORMAT column contains 2-letter identifier values separated by colons (:), which correspond to data values in the SAMPLE column (also separated by colons). The Type and Number of each data value (separated by commas), is given by the meta-information header. Here are some commonly used ones:

Tag	Description
AD	Unfiltered Allele Depth, number of reads supporting each allele.
GT	GenoType, described in more detail below.
DP	Filtered DePth, number of reads supporting each allele.
GQ	Phred Genotype Quality, confidence that `GT` is correct, derived from `PL`.
PL	Phred Likelihoods of the possible genotypes.

Genotypes: `GT`

With polyploid genomes, variants do not necessarily occur on every copy of the genome. The GT tag contains information regarding this.

GT	Name	Description
0/0	homozygous reference	No mutations (REF/REF).
0/1	heterozygous	Mutation on one copy (REF/ALT1).
1/1	homozygous alternate	Same mutation on each copy (ALT1/ALT1).
1/2	heterozygous alternate	Different mutation on each copy (ALT1/ALT2).

Additionally, the delimiter typically changes from a slash (/) to a pipe (|) when the haplotypes have been phased. This means that instead of knowing that the variant occurs on one of two haplotypes, you also know which haplotype it occurred on. The following table shows how this works.

REF	ALT	GT	Hap 1	Hap 2
A	C	0\|0	A	A
A	C	0\|1	A	C
A	C	1\|0	C	A
A	C	1\|1	C	C
A	C,T	1\|2	C	T
A	C,T	2\|1	T	C

TBI

Tabix is a binary file format commonly used for indexing compressed VCF files. Its full specification can be found here. Indexing and compression can either be performed using bcftools:

bcftools view file.vcf -Oz -o <filename>.vcf.gz
bcftools index <filename>.vcf.gz

or using the stand-alone bgzip and tabix programs:

bgzip <filename.vcf>
tabix -p vcf <filename>.vcf.gz

See/Add Comments ()