Lecture 17: DNA Sequencing Technologies

Learning Objectives

By the end of this lecture, you will be able to:

  • Explain the principle of sequencing-by-synthesis and how it differs from sequencing-by-ligation
  • Compare historical sequencing platforms (454, SOLiD, Ion Torrent) and explain why they were displaced
  • Describe Illumina’s dye chemistry evolution from 4-channel to 2-channel to 1-channel imaging
  • Explain Element Biosciences’ avidite chemistry and why decoupling detection from incorporation matters
  • Describe PacBio’s zero-mode waveguides and how circular consensus sequencing produces HiFi reads
  • Explain nanopore sequencing, direct RNA sequencing, and adaptive sampling
  • Choose an appropriate sequencing platform given a specific experimental goal

1. Introduction: The Sequencing Revolution

DNA sequencing has undergone extraordinary transformations since Frederick Sanger’s chain-termination method in 1977. That original approach – using dideoxynucleotides to terminate growing DNA chains at random positions – dominated for nearly three decades and was the workhorse behind the Human Genome Project (completed 2003). The human genome cost approximately $2.7 billion to sequence using Sanger technology.

The transition from Sanger to next-generation sequencing (NGS) began in 2005 and triggered a cost reduction that outpaced Moore’s Law by orders of magnitude. We now routinely sequence a human genome for under $200 in a matter of hours.

Generations of Sequencing

Generation Era Key Feature Examples
1st 1977-2004 Chain termination Sanger (ABI 3730)
2nd 2005-present Massively parallel short reads 454, Illumina, SOLiD, Ion Torrent, Element
3rd 2011-present Single-molecule, long reads PacBio, Oxford Nanopore
NoteThe Common Thread

Nearly all modern sequencing platforms share the sequencing-by-synthesis (SBS) principle: watch a DNA polymerase incorporate nucleotides one at a time and detect which base was added at each step. The platforms differ in how they detect incorporation – light, pH change, electrical current, or fluorescence duration. (SOLiD is a notable exception, using oligonucleotide ligation rather than polymerase activity.)


2. Historical Technologies

Three platforms dominated the early NGS era. All have been largely superseded, but understanding them illuminates the design trade-offs that shaped current technology.

2.1 454 Pyrosequencing

454 Life Sciences launched the first commercial NGS platform in 2005 (Margulies et al. 2005, Nature 437:376-380). Roche acquired the company in 2007 and discontinued the platform in 2013.

How it worked:

  1. Fragment DNA and ligate adapters
  2. Capture single fragments on beads via emulsion PCR – each bead carries ~10 million copies of one fragment
  3. Deposit beads into picotiter plate wells (one bead per well)
  4. Flow nucleotides sequentially (T, then A, then G, then C, repeat)
  5. When a nucleotide is incorporated, pyrophosphate (PPi) is released
  6. PPi is converted to ATP by sulfurylase; ATP drives luciferase to produce a flash of light
  7. A CCD camera captures the light; signal intensity is proportional to the number of bases incorporated

Key characteristics:

  • Read length: up to ~1000 bp (longest of any NGS platform at the time)
  • Throughput: ~700 Mb per run
  • Main error type: homopolymer insertion/deletions – runs of identical bases (e.g., AAAA) produce a single light pulse whose intensity must be quantified, and intensity estimation degrades for runs >6 bp
  • Run time: ~10 hours

454 Pyrosequencing Animation

Watch nucleotides flow over a well. When the correct nucleotide is incorporated, pyrophosphate triggers a light flash.

2.2 SOLiD (Sequencing by Oligonucleotide Ligation and Detection)

Applied Biosystems launched SOLiD in 2007, building on work by Shendure et al. 2005, Science 309:1728-1732.

How it worked:

  1. DNA fragments amplified on beads via emulsion PCR (similar to 454)
  2. Beads deposited on a glass slide
  3. Sequencing proceeds by ligation, not polymerase synthesis
  4. Short fluorescently-labeled di-base probes (8-mers) hybridize to the template adjacent to a primer
  5. DNA ligase joins probe to the primer
  6. Fluorophore is imaged, then cleaved
  7. Process repeats with offset primers – each base is interrogated twice from different ligation rounds

Two-base encoding (color space):

Each fluorescent label encodes a dinucleotide, not a single base. The read is recorded as a series of colors (0, 1, 2, 3) rather than bases. Decoding requires knowing the first base, and a single error propagates through the entire read. However, true SNPs change only one color, while sequencing errors change two consecutive colors – enabling powerful error correction.

Key characteristics:

  • Read length: ~75 bp (short)
  • Accuracy: very high (~99.94%) after color-space error correction
  • Throughput: up to 30 Gb per run
  • Main limitation: short reads, complex analysis pipeline, slow adoption

2.3 Ion Torrent

Life Technologies (now Thermo Fisher) launched Ion Torrent in 2010 (Rothberg et al. 2011, Nature 475:348-352). This was the first post-light sequencing technology.

How it worked:

  1. DNA fragments on beads via emulsion PCR, loaded into microwells on a semiconductor chip
  2. Nucleotides flowed sequentially (like 454)
  3. When a nucleotide is incorporated, a hydrogen ion (H+) is released, lowering the pH
  4. An ISFET (ion-sensitive field-effect transistor) sensor beneath each well detects the pH change
  5. No optics, no cameras – purely electronic detection on a standard CMOS chip

Key characteristics:

  • Read length: up to 400 bp (Ion S5)
  • Run time: 2-4 hours (fastest of early NGS)
  • Main error: homopolymer indels (same problem as 454 – voltage proportional to number of incorporations)
  • Still used in clinical settings (Ion GeneStudio S5) for targeted panels
ImportantWhy Homopolymers Are Hard

Both 454 and Ion Torrent measure the magnitude of a signal (light intensity or voltage) to determine how many identical bases were incorporated in a single flow. Distinguishing 5 incorporations from 6 is inherently imprecise because signal scales sub-linearly and noise increases. Illumina avoids this entirely by incorporating one base at a time using reversible terminators.

Comparison of Historical Platforms

Feature 454 SOLiD Ion Torrent
Year launched 2005 2007 2010
Detection method Light (pyrosequencing) Fluorescence (ligation) pH (semiconductor)
Read length ~1000 bp ~75 bp ~400 bp
Max throughput/run ~700 Mb ~30 Gb ~50 Gb
Primary error type Homopolymer indels Substitutions Homopolymer indels
Status Discontinued 2013 Discontinued ~2015 Still sold (clinical)

3. Illumina

Illumina has dominated sequencing since approximately 2010, commanding >80% of the global sequencing market at its peak. The company’s technology traces back to Solexa (founded 1998, acquired by Illumina 2007) and the foundational paper by Bentley et al. 2008, Nature 456:53-59.

Core Principle: Bridge Amplification + Reversible Terminators

Library preparation and cluster generation:

  1. Fragment DNA and ligate adapters containing P5 and P7 sequences
  2. Denature and load single-stranded fragments onto a flow cell coated with complementary oligos
  3. Each fragment hybridizes to a surface oligo and is copied by a polymerase
  4. The original template is washed away; the copied strand bends over and hybridizes to an adjacent oligo – forming a bridge
  5. Bridge amplification repeats ~35 cycles, creating a clonal cluster of ~1,000 identical copies
  6. Forward strands are cleaved (or reverse strands, depending on read); single-stranded clusters remain

Sequencing by synthesis:

  1. A sequencing primer hybridizes to the adapter
  2. All four nucleotides are added simultaneously – each carries:
    • A fluorescent dye (color identifies the base)
    • A 3’-O-azidomethyl blocking group (prevents further incorporation)
  3. After one nucleotide incorporates, excess is washed away
  4. The flow cell is imaged to capture fluorescence from every cluster
  5. Both the dye and the 3’-block are chemically cleaved
  6. The cycle repeats

This ensures exactly one base per cycle – no homopolymer ambiguity.

3.1 Four-Channel Chemistry (4-dye)

The original Illumina chemistry used on HiSeq, MiSeq, and GA platforms.

  • 4 distinct fluorescent dyes, one per base
  • 4 images per cycle (one filter per dye)
  • Each base has a unique emission wavelength

3.2 Two-Channel Chemistry (2-dye)

Introduced on NextSeq and NovaSeq 6000 to reduce imaging time.

  • Only 2 dyes are used
  • A = Red + Green (both channels bright)
  • C = Red only
  • T = Green only
  • G = No dye (“dark” – unlabeled)
  • 2 images per cycle instead of 4

Illumina Dye Chemistry Comparison

Compare how the same cluster appears under 4-channel, 2-channel, and 1-channel imaging. Click a base to see its signature.

3.3 One-Channel Chemistry (1-dye)

Used on the iSeq 100 with an integrated CMOS sensor:

  • Only 1 dye, but 2 chemistry steps per cycle
  • Step 1: A and T are labeled; image taken
  • Step 2: Chemically modify so only C and T are labeled; image taken
  • T = bright in both images; A = bright in step 1 only; C = bright in step 2 only; G = dark in both
  • Extremely compact instrument (benchtop, ~$20k list price)

3.4 XLEAP-SBS Chemistry

Illumina’s latest chemistry, deployed on NovaSeq X/X Plus and NextSeq 1000/2000:

  • Redesigned polymerase and dye-linker system
  • 2x faster incorporation kinetics
  • ~50% lower error rate than previous chemistry
  • Reduced phasing/pre-phasing artifacts
  • 2-channel imaging, but with different base-color assignments than standard 2-channel: C is now the dual-color base (both channels), A is single-channel (blue only), T = green only, G = dark. This change affects index color-balancing requirements
NoteWhat Is Phasing?

Over many cycles, some DNA strands in a cluster fall behind (incomplete extension = phasing) or jump ahead (incorporation without blocking = pre-phasing). This makes clusters progressively out of sync, degrading quality scores toward the end of the read. XLEAP’s improved blocking chemistry reduces both effects.

Current Illumina Platforms

Platform Chemistry Max Output Max Read Length Key Use Case
MiSeq 4-channel 15 Gb 2 x 300 bp Amplicons, small genomes
NextSeq 2000 2-channel (XLEAP) 360 Gb (P3, 2x150) 2 x 300 bp (P1/P2, up to 240 Gb) Mid-throughput, flexible
NovaSeq X Plus 2-channel (XLEAP) 16 Tb 2 x 150 bp Production-scale WGS

4. Element Biosciences

Element Biosciences (founded 2017, San Diego) introduced a fundamentally different approach to short-read sequencing with the AVITI system (launched 2022). The key innovation is described in Arslan et al. 2024, Nature Biotechnology 42:132-138.

Sequencing by Binding (SBB) / Avidite Chemistry

The central insight: decouple detection from incorporation.

In Illumina’s approach, the fluorescent dye is covalently attached to the nucleotide that gets incorporated. Even after cleaving the dye, a molecular “scar” remains on the growing strand. These scars accumulate over cycles and can interfere with polymerase activity and base calling.

Element’s approach uses a two-step process:

  1. Detection step: A multivalent construct called an avidite – consisting of a fluorescently-labeled streptavidin core connected via flexible linkers to multiple identical nucleotides – binds cooperatively (by avidity) to polymerase-template complexes across the polony. The avidite identifies the next correct base through polymerase selectivity. The cluster is imaged.
  2. Incorporation step: The avidite is washed away (it was never covalently incorporated). Then, an unlabeled 3’-blocked nucleotide is added and incorporated by the polymerase. The block is removed for the next cycle.

Avidite sequencing cycle. (a) Six-step cycle: avidites bind to polonies, unbound avidites are washed away, base identity is detected by fluorescence imaging, avidites are removed, a reversibly terminated nucleotide is incorporated, and the 3’ block is removed. (b) A single avidite interacts with multiple polymerase-template copies within one polony via multivalent binding (avidity). (c) Many avidites specifically bind to polonies across the flow cell surface. From Arslan et al. 2024, Fig. 1.
ImportantWhy This Matters

Because the incorporated nucleotide never carried a dye, there is no scarring. The DNA strand being synthesized contains only natural nucleotides with a removable 3’ block. This contributes to Element’s very high quality scores – most bases achieve Q40+ (99.99% accuracy).

Element quality scores. Q-Q plots of predicted vs recalibrated quality scores for R1 (a) and R2 (b) reads on the AVITI. The majority of bases cluster at Q40+, demonstrating the accuracy advantage of avidite chemistry. From Arslan et al. 2024, Fig. 3.

Homopolymer accuracy. Element’s avidite chemistry maintains high accuracy across homopolymer runs of varying length, outperforming traditional SBS approaches where dye scarring accumulates. From Arslan et al. 2024, Fig. 5.

Additional Innovations

  • Rolling Circle Amplification (RCA) instead of bridge PCR: Generates polony clusters from circular DNA templates. This eliminates PCR-induced errors, optical duplicates, and index hopping.
  • Two independent flow cells per AVITI run, each producing ~1 billion read pairs
  • No index hopping: Because RCA does not involve free adapter-ligated fragments in solution, cross-contamination between indexed samples is essentially zero

AVITI Specifications

Feature AVITI 24x24
Read pairs per flow cell ~1 billion
Max output per flow cell 300 Gb
Read length 2 x 150 bp (2 x 300 bp available)
Quality >90% bases Q30; Q40+ with Cloudbreak UltraQ chemistry
Run time ~24-48 hr
List price ~$289,000

In late 2024, Element launched the AVITI24, an upgraded platform delivering up to 1.5 billion reads per flow cell, multiomics capability (co-detection of RNA, protein, and morphology alongside sequencing), and Cloudbreak UltraQ kits achieving >70% of bases at Q50. List price: $424,000 ($150,000 as an upgrade from the original AVITI).


5. PacBio

Pacific Biosciences (founded 2004) pioneered single-molecule real-time (SMRT) sequencing, first described in Eid et al. 2009, Science 323:133-138. The enabling technology – the zero-mode waveguide – was developed by Levene et al. 2003, Science 299:682-686.

Zero-Mode Waveguides (ZMWs)

A ZMW is a cylindrical hole approximately 70 nm in diameter drilled through a ~100 nm metal film (aluminum) deposited on a glass substrate. The diameter is smaller than the wavelength of the excitation light (~500-600 nm), creating a condition where light cannot propagate through the hole. Instead, an evanescent field illuminates only the bottom ~30 nm of the well.

Zero-mode waveguide apparatus. A metal film on fused silica contains nanometer-scale holes. Illumination from below creates an evanescent field that excites only fluorescent molecules at the very bottom of the well, where a single enzyme (polymerase) is attached. From Levene et al. 2003, Fig. 1.

How sequencing works:

  1. A single DNA polymerase molecule is attached to the bottom of each ZMW
  2. The template DNA (with SMRTbell adapters) is loaded and binds to the polymerase
  3. Fluorescently labeled dNTPs (each base has a different color dye attached to the phosphate chain, not the base) diffuse freely in solution
  4. When a dNTP enters the ZMW and is incorporated, the polymerase holds it in the detection zone for milliseconds – long enough to capture the fluorescent signal
  5. Incorporation cleaves the phosphodiester bond, releasing the dye-labeled pyrophosphate
  6. The polymerase advances to the next position
  7. All of this is captured as a continuous movie of fluorescence pulses

Real-time single-molecule sequencing. Fluorescence pulses recorded as a polymerase incorporates labeled nucleotides in a ZMW. Each color corresponds to a different base (A, C, G, T). Pulse duration and inter-pulse intervals encode kinetic information. From Eid et al. 2009, Fig. 1.

Zero-Mode Waveguide (ZMW) Sequencing

A single polymerase at the bottom of the ZMW incorporates fluorescent nucleotides. The dye is on the phosphate group — released after incorporation.

Sequence: (press Start)
NoteDye on the Phosphate, Not the Base

Unlike Illumina (where the dye is attached to the base and must be cleaved), PacBio attaches the dye to the terminal phosphate of the dNTP. During incorporation, the natural phosphodiester bond formation releases the dye-labeled pyrophosphate. The incorporated nucleotide is completely natural – no cleavage step needed.

Circular Consensus Sequencing (CCS) and HiFi Reads

PacBio’s raw single-pass accuracy is approximately 85-90% (Q10-Q13), dominated by insertion and deletion errors. The breakthrough that made PacBio competitive for accuracy-sensitive applications is circular consensus sequencing (CCS), now marketed as HiFi reads.

How CCS works:

  1. SMRTbell library: Hairpin adapters are ligated to both ends of a double-stranded DNA fragment, creating a circular (topologically closed) molecule
  2. The polymerase begins at one adapter and reads through the insert, across the other adapter, back through the insert on the complementary strand, and so on
  3. Each complete traverse of the insert is called a subread
  4. With a 15 kb insert and a polymerase that reads ~100 kb before dissociation, you get ~6-7 passes
  5. A consensus algorithm aligns the subreads and calls a consensus base at each position
  6. Because insertion and deletion errors are essentially random, they cancel out across passes

Accuracy scaling:

\[Q_{\text{CCS}} \approx Q_{\text{subread}} + 10 \cdot \log_{10}(n)\]

where \(n\) is the number of passes. With ~10 passes and Q10 subreads:

\[Q_{\text{CCS}} \approx 10 + 10 \cdot \log_{10}(10) = 20 \text{ (per pass; actual consensus methods achieve Q30+)}\]

In practice, modern HiFi reads routinely achieve Q30-Q40 (99.9-99.99% accuracy) with median read lengths of 15-20 kb.

Circular consensus sequencing (CCS). SMRTbell adapters (hairpins) create a circular template. The polymerase reads around the circle multiple times, generating subreads that are combined into a single high-accuracy HiFi read. From Wenger et al. 2019, Fig. 1a.

Accuracy improves with passes. As the number of polymerase passes increases, the consensus quality rises sharply. With ~10 passes, accuracy reaches Q30+; with more passes, Q40+ is achievable. From Wenger et al. 2019, Fig. 2b.

Kinetics: Detecting Base Modifications

Because PacBio records the entire movie of incorporation, it captures two kinetic parameters:

  • IPD (inter-pulse duration): Time between consecutive incorporation events
  • PW (pulse width): Duration of each incorporation event

Modified bases (e.g., 5mC, 6mA, 4mC) alter the polymerase kinetics, changing IPD and PW in characteristic patterns. This enables direct detection of DNA methylation without bisulfite conversion.

Detecting modified bases via kinetics. Modified bases (top) cause the polymerase to pause differently compared to unmodified bases, altering the inter-pulse duration (IPD). These kinetic signatures allow direct detection of methylation without chemical conversion. From Flusberg et al. 2010, Fig. 1.

Current PacBio Platforms

Platform ZMWs per Cell HiFi Output Read Length Price
Revio 25 million ~90 Gb/cell (4 cells/run) 10-25 kb ~$779,000
Vega 1 SMRT Cell up to ~60 Gb 10-25 kb ~$169,000

6. Oxford Nanopore Technologies (ONT)

Oxford Nanopore Technologies (founded 2005) launched the MinION in 2014, introducing a radically different sequencing paradigm. Key references include Jain et al. 2015, Nature Methods 12:351-356 and Deamer et al. 2016, Nature Biotechnology 34:518-524.

Principle: Ionic Current Through a Protein Pore

Nanopore sequencing principle. A protein pore is embedded in a membrane. As molecules interact with the pore, they modulate ionic current flow in characteristic ways – from open pore (left) to partial blockade to full blockade (right). The current trace below each panel shows the corresponding signal.
  1. A protein nanopore (derived from CsgG, an E. coli curli secretion channel) is embedded in a synthetic polymer membrane
  2. A voltage is applied across the membrane, driving an ionic current through each pore
  3. A motor protein (helicase) is ligated to the DNA via sequencing adapters
  4. The motor protein ratchets single-stranded DNA through the pore one base at a time
  5. As each k-mer passes through the constriction(s) of the pore, it partially blocks the ionic current. The original R9 pore read ~5 bases at a time; the current R10 pore with dual constrictions reads ~9 bases simultaneously
  6. Each k-mer produces a characteristic current level – with \(4^9 = 262{,}144\) possible levels for R10
  7. The raw current signal (“squiggle”) is basecalled by a neural network (Dorado)
NotePore Versions: R9 to R10

The original R9 pore (CsgG mutant) had a single constriction point and ~90-97% raw simplex accuracy (depending on basecaller version). The current R10 pore features a longer barrel with dual constriction points, dramatically improving homopolymer resolution and pushing simplex accuracy above 99%. All current ONT flow cells ship with R10.4.1 chemistry.

Nanopore Sequencing Visualization

DNA is threaded through a protein pore. Each ~5-mer modulates the ionic current differently, producing a characteristic "squiggle" signal.

Ready. Press "Thread DNA" to begin.

Direct RNA Sequencing

One of ONT’s most distinctive capabilities is direct RNA sequencing – reading native RNA molecules without reverse transcription or PCR amplification.

How it works:

  1. A sequencing adapter with a motor protein is ligated directly to the poly(A) tail of mRNA (or to any RNA with an adapter)
  2. The motor protein feeds the RNA strand through the pore 3’ to 5’
  3. The current signal reflects the RNA bases (A, C, G, U) passing through the pore
  4. Dorado basecalls the RNA signal using RNA-specific models

RNA modifications detected:

Direct RNA sequencing preserves all chemical modifications on the RNA molecule. Current modification detection models can identify:

  • m6A (N6-methyladenosine) – the most abundant mRNA modification
  • m5C (5-methylcytidine)
  • pseudouridine (\(\Psi\))
  • inosine (from A-to-I editing)

The latest RNA004 chemistry (2025) provides significantly improved accuracy and throughput for direct RNA applications.

ImportantWhy Direct RNA Seq Matters

Every other RNA sequencing method requires reverse transcription to cDNA, which introduces biases (template switching, length bias, loss of modifications). Direct RNA sequencing reads the native molecule, preserving the full modification landscape and enabling study of the epitranscriptome at single-molecule resolution.

Adaptive Sampling (ReadUntil)

ONT’s software can make real-time decisions about individual molecules:

  1. As a read begins, the first few hundred bases are basecalled in real time
  2. The software maps this fragment against a reference or target list
  3. If the molecule is on-target, sequencing continues normally
  4. If the molecule is off-target, the voltage is briefly reversed, ejecting the strand from the pore
  5. The pore is immediately available for a new molecule

This enables targeted sequencing without physical enrichment – no PCR, no capture probes. In practice, adaptive sampling can achieve 5-10x enrichment of target regions.

Adaptive sampling (ReadUntil). Software maps the first portion of each read in real time. On-target reads continue sequencing; off-target reads are ejected by reversing voltage, freeing the pore for the next molecule.

Duplex Sequencing

When both strands of a double-stranded DNA molecule are sequenced consecutively through the same pore (the complement strand follows the template strand), the two reads can be combined into a duplex read with accuracy exceeding Q30 (>99.9%). This is analogous to PacBio’s CCS concept but achieved differently.

Duplex vs simplex sequencing. In simplex mode, only one strand is read. In duplex mode, both template and complement strands pass through the same pore consecutively, enabling consensus calling that dramatically improves accuracy.

Current ONT Devices

Device Flow Cells Channels per Cell Max Output Use Case
MinION 1 512 (2,048 pores) ~50 Gb Portable, field work, teaching
P2 Solo 2 2,675 ~290 Gb/cell Benchtop production
PromethION 48 48 2,675 ~14 Tb Large-scale production

Data Formats

  • POD5: Current raw signal format (Apache Arrow-based), replacing the legacy FAST5 format (which was HDF5-based)
  • Dorado: ONT’s basecaller, uses neural networks to convert raw signal to sequence
  • Output: BAM (with move table for signal-to-base mapping) or FASTQ

7. Technology Comparison

Feature Illumina (NovaSeq X) Element (AVITI) PacBio (Revio) ONT (PromethION)
Read length 2 x 150 bp 2 x 150 bp 10-25 kb (HiFi) Ultra-long possible (>100 kb)
Single-read accuracy ~99.9% (Q30) ~99.99% (Q40) ~99.9% (Q30 HiFi) ~99% (Q20 simplex); Q30+ duplex
Max output/run 16 Tb 600 Gb 360 Gb/day (4 cells) ~14 Tb
Cost per Gb ~$2 ~$2-5 ~$8 ~$8-15
Time to first results ~24-48 hr ~24-48 hr ~24 hr Minutes (real-time)
Primary error type Substitutions Substitutions Indels (raw); balanced (HiFi) Indels in homopolymers (improved with R10)
Detect base modifications No (need bisulfite/EM-seq) No Yes (kinetics) Yes (current signal)
Direct RNA No No No Yes
Adaptive sampling No No No Yes
Portability Lab only Lab only Lab only Field-deployable (MinION)
NoteNo Single “Best” Platform

Platform choice depends on the experiment. Illumina and Element dominate for high-throughput short-read applications (WGS, RNA-seq, ATAC-seq). PacBio excels at genome assembly, structural variant detection, and methylation. ONT uniquely enables real-time, portable, and direct RNA sequencing. Many projects now combine short and long reads for optimal results.


8. Practicalities: Data Formats and Exploration

Hands-on data exploration will be covered in companion Jupyter notebooks. Here we briefly introduce the key data formats you will encounter.

FASTQ (Illumina, Element, ONT after basecalling)

The universal short-read format. Four lines per read:

@read_id optional_description
ACGTACGTACGTACGT...
+
FFFFFFFFFFFFFFFF...
  • Line 1: Header (starts with @)
  • Line 2: DNA sequence
  • Line 3: Separator (+)
  • Line 4: Quality scores (ASCII-encoded Phred scores; F = Q37)
# Peek at a FASTQ file
import gzip
with gzip.open("sample.fastq.gz", "rt") as f:
    for i, line in enumerate(f):
        print(line.strip())
        if i >= 7:  # show 2 reads
            break

BAM (PacBio HiFi, also general alignment format)

Binary Alignment/Map format. For PacBio, unaligned BAM files store HiFi reads along with per-base kinetics (IPD and pulse width) in auxiliary tags.

# Peek at a PacBio HiFi BAM
samtools view -h sample.hifi.bam | head -20

Key PacBio BAM tags:

  • rq: Read quality (float, 0-1)
  • np: Number of full passes
  • fi / ri: Forward/reverse IPD values
  • fp / rp: Forward/reverse pulse width values

POD5 (ONT raw signal)

HDF5-based format storing the raw electrical signal from each nanopore read. This is the native ONT format, replacing the legacy FAST5 format.

# Peek at a POD5 file
import pod5
with pod5.Reader("sample.pod5") as reader:
    for read in reader.reads():
        print(f"Read ID: {read.read_id}")
        print(f"Signal length: {len(read.signal)} samples")
        print(f"Sample rate: {read.run_info.sample_rate} Hz")
        break
NoteSignal to Sequence

ONT raw data is not DNA sequence – it is a time series of electrical current measurements (picoamps) sampled at ~4-5 kHz. Basecalling with Dorado converts this signal into nucleotide sequence. You can re-basecall the same POD5 file with updated models to improve accuracy without re-sequencing.

Format Summary

Format Type Platform Contains Typical Size
FASTQ(.gz) Text (compressed) All Sequence + quality Varies widely (1 GB to hundreds of GB)
BAM Binary PacBio, alignments Sequence + quality + kinetics ~10-50 GB per cell
POD5 Apache Arrow ONT Raw electrical signal ~50-200 GB per flow cell
FAST5 HDF5 (legacy) ONT Raw signal (single/multi) Deprecated; convert to POD5

Choosing a Platform

  • Population-scale WGS or RNA-seq → Illumina NovaSeq X or Element AVITI
  • De novo genome assembly → PacBio HiFi (Revio)
  • Structural variant detection → PacBio HiFi or ONT long reads
  • Field diagnostics or rapid surveillance → ONT MinION
  • Epitranscriptomics / RNA modifications → ONT direct RNA sequencing
  • Targeted panels or amplicons → Illumina MiSeq or ONT adaptive sampling

9. On the Horizon: Roche Sequencing by Expansion (SBX)

Roche – which exited sequencing in 2013 after discontinuing 454 – re-entered the field in February 2025 with a radically different approach: Sequencing by Expansion (SBX) (Fuller et al. 2025, bioRxiv).

The Xpandomer Concept

SBX overcomes a fundamental limitation of nanopore sequencing: the difficulty of resolving individual bases as they pass through a pore. Instead of reading native DNA directly, SBX first converts each DNA template into a surrogate polymer called an Xpandomer that is ~50x longer than the original.

How it works:

  1. DNA polymerase copies the template using expandable nucleotide triphosphates (X-NTPs) – synthetic building blocks that each encode one base but contain a long, expandable linker and a high-signal reporter code
  2. After synthesis, the Xpandomer is chemically expanded, stretching the reporter codes apart
  3. Each reporter includes a translocation control element that pauses the Xpandomer in the nanopore – one reporter at a time
  4. The reporter blocks ionic current in one of four distinct levels, each uniquely identifying A, C, G, or T
  5. Voltage pulses advance the Xpandomer through the pore one step at a time
ImportantWhy Expansion Matters

In conventional nanopore sequencing (ONT), ~5-9 bases simultaneously occupy the pore constriction, creating a complex signal that requires neural network deconvolution and struggles with homopolymers. SBX eliminates this problem entirely: each base is read individually through a single, unambiguous current level. This is a fundamentally different approach to the signal-to-noise problem.

Hardware: CMOS Nanopore Arrays

The SBX platform uses a CMOS-based sensor chip containing ~8 million microwells, each housing a single nanopore. Each microwell integrates the electrode, detection circuit, and analog-to-digital converter on-chip – similar in concept to Ion Torrent’s semiconductor approach but reading current blockade rather than pH.

Performance

  • Accuracy: Demonstrated F1 scores >99.80% (SNV) and >99.7% (InDel) for whole-genome sequencing
  • Speed: In collaboration with Broad Clinical Labs and Boston Children’s Hospital, SBX achieved a Guinness World Record for fastest DNA sequencing – a complete genome in under 4 hours
  • Applications demonstrated: Whole-genome sequencing, bulk RNA-seq, methylation mapping, and spatial multiomics

Platform: AXELIOS 1

The first commercial instrument, AXELIOS 1, is planned for Research Use Only (RUO) launch in 2026. Roche aims to eventually bring SBX into clinical diagnostics, leveraging its existing global diagnostics infrastructure.

NoteFull Circle

Roche’s return to sequencing is notable: the company that pioneered commercial NGS with 454 pyrosequencing (2005-2013) has now developed a technology that combines biochemical innovation (Xpandomers) with nanopore readout and semiconductor manufacturing – drawing on ideas from all three generations of sequencing.


Key References

  1. Sanger F, Nicklen S, Coulson AR. DNA sequencing with chain-terminating inhibitors. PNAS. 1977;74(12):5463-5467.
  2. Margulies M, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437:376-380. doi:10.1038/nature03959
  3. Shendure J, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science. 2005;309:1728-1732. doi:10.1126/science.1117389
  4. Rothberg JM, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475:348-352. doi:10.1038/nature10242
  5. Bentley DR, et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456:53-59. doi:10.1038/nature07517
  6. Arslan S, et al. Sequencing by avidity enables high accuracy with low reagent consumption. Nature Biotechnology. 2024;42:132-138. doi:10.1038/s41587-023-01750-7
  7. Eid J, et al. Real-time DNA sequencing from single polymerase molecules. Science. 2009;323:133-138. doi:10.1126/science.1162986
  8. Levene MJ, et al. Zero-mode waveguides for single-molecule analysis at high concentrations. Science. 2003;299:682-686. doi:10.1126/science.1079700
  9. Jain M, et al. Improved data analysis for the MinION nanopore sequencer. Nature Methods. 2015;12:351-356. doi:10.1038/nmeth.3290
  10. Deamer D, Akeson M, Branton D. Three decades of nanopore sequencing. Nature Biotechnology. 2016;34:518-524. doi:10.1038/nbt.3423
  11. Fuller CW, et al. Sequencing by Expansion (SBX) – a novel, high-throughput single-molecule sequencing technology. bioRxiv. 2025. doi:10.1101/2025.02.19.639056