Lecture 14: Finding Sequence Matches Quickly

Learning Objectives

By the end of this lecture, you will be able to: - Explain why brute-force alignment is impractical for modern sequencing data - Distinguish between mapping and alignment - Build and search substring indexes - Understand hash functions and hash tables in the context of genomics - Describe how BLAST uses word lookup tables to accelerate homology search - Construct and search tries and suffix tries - Explain the relationship between suffix tries, suffix trees, and suffix arrays - Perform the Burrows-Wheeler Transform and understand its properties - Describe how the FM index enables fast, memory-efficient sequence search - Compare and contrast Bowtie2 and BWA-MEM alignment strategies


A bit more Claudification

  • “How Quickly Will A.I. Agents Rip Through the Economy?” – Jack Clark on The Ezra Klein Show (NYT, Feb 24 2026). Jack Clark, co-founder of Anthropic, discusses how AI agents are reshaping work. He notes that Claude is already writing “comfortably the majority” of Anthropic’s own code, and that the value of senior engineers with well-calibrated intuitions is increasing while routine coding tasks are increasingly automated.
  • Claude Code Agent Teams documentation – Agent teams let you coordinate multiple Claude Code instances working together – one session acts as team lead, spawning teammates that work independently in their own context windows, communicate with each other, and share a task list.

1. Introduction: The Challenge of Large Datasets

Modern sequencing instruments produce staggering volumes of data. An Illumina NovaSeq 6000 can generate up to 20 billion reads in a single run. Each of those reads must be placed onto a reference genome – for humans, that means searching through \(3 \times 10^9\) base pairs.

Why Dynamic Programming Fails at Scale

In earlier lectures we learned about dynamic programming alignment algorithms like Smith-Waterman and Needleman-Wunsch. These are beautiful algorithms, but they have a critical limitation: time complexity.

For a single pairwise alignment of a read of length \(m\) against a reference of length \(n\):

\[T = O(mn)\]

This is fine for comparing two sequences. But what happens at genome scale?

Parameter Value
Reference genome length (\(n\)) \(3 \times 10^9\) bp (human)
Number of reads (\(d\)) \(20 \times 10^9\) (NovaSeq run)
Average read length (\(m\)) 150 bp
Total operations \(d \times m \times n\)

Let’s compute the total number of operations:

\[d \times m \times n = 20 \times 10^9 \times 150 \times 3 \times 10^9 = 9 \times 10^{21}\]

That is 9 sextillion operations. Even at \(10^9\) operations per second per CPU core, this would require:

\[\frac{9 \times 10^{21}}{10^9 \times 3.15 \times 10^7} \approx 285,000 \text{ CPU-years}\]

On a cluster of 1,000 CPUs, this would still take over 285 years. Clearly, we cannot align every read against the entire genome using dynamic programming.

ImportantKey Insight

We need a fundamentally different strategy. Instead of comparing every read against every position in the genome, we need a way to jump directly to the most promising locations. This is the idea behind indexing.

The rest of this lecture traces the intellectual journey from simple indexes to the sophisticated data structures used by modern aligners like Bowtie2 and BWA-MEM.

NoteRoadmap

Here is the path we will follow – each step motivated by limitations of the previous one:

Step Data Structure / Tool Limitation it addresses
1 Brute Force (baseline – too slow)
2 Substring Index (fixed k, many false hits)
3 Hash Tables (fast lookup for k-mers)
4 BLAST (word lookup + extend – fast homology search)
5 Tries (tree-based prefix lookup)
6 Suffix Trie (any-length query, but O(n²) space)
7 Suffix Tree (linear space, but 47 GB for human)
8 Suffix Array (12 GB, simpler)
9 BWT (compressible and searchable)
10 FM Index (~1.25 GB, fast search)
11 Bowtie2 / BWA-MEM (practical tools you will use)

2. Mapping vs. Alignment

Before diving into indexing, let’s clarify two terms that are often confused.

The tools we will study in this lecture were shaped by a handful of researchers. Heng Li, a student of Richard Durbin, created BWA, SAMtools, and minimap2 – his tools are used in virtually every genomics study published today. Ben Langmead, a student of Steven Salzberg, developed Bowtie and Bowtie2, among the most widely cited alignment tools in history. And Webb Miller pioneered the foundational sequence alignment algorithms (including BLASTZ and multiz) that made comparative genomics possible. You will encounter their work throughout this lecture.

Heng Li provides a useful distinction between two terms that are often confused:

Concept Definition Goal
Mapping Determining the most likely genomic region a read originates from Find the location
Alignment Determining the base-by-base correspondence between read and reference Find the exact placement

Mapping is like determining that your sequence likely came from chromosome 17. Alignment is determining the exact base-pair coordinates (e.g., chr17:7,668,402-7,668,551) and finding that there is a G>A substitution at position 120 of the read.

In practice, most tools do both: they first map (find candidate locations using an index) and then align (perform local alignment at those locations). This two-phase approach – often called seed-and-extend – is the foundation of nearly all modern read aligners.


3. Indexing with Substrings

The Inverted Index Concept

Think about the index at the back of a book. When you want to find where “transcription” is discussed, you don’t read the entire book – you look up the word in the index, which tells you the page numbers. This is an inverted index: given a term, it returns the locations where that term appears.

We can apply the same idea to DNA sequences:

  • Forward index: Given a position, what substring is there? (This is just the genome itself.)
  • Inverted index: Given a substring, where does it occur? (This is what we need for mapping.)

Building an Index from Substrings

To create an index from a text \(T\), we:

  1. Extract all substrings of a fixed length \(L\) (these are called L-mers or k-mers)
  2. Record each substring along with the position where it occurs
  3. Organize these entries so that lookups are fast

For example, with text CGTGCCTACTTACTTACAT and \(L = 5\):

Position 5-mer
0 CGTGC
1 GTGCC
2 TGCCT
3 GCCTA

Searching the Index

To find where a pattern \(P\) occurs:

  1. Extract a substring of length \(L\) from \(P\)
  2. Look it up in the index to get candidate positions
  3. Extend the match at each candidate position to verify the full pattern

Trade-offs

The choice of \(L\) (substring length) and the interval between extracted substrings involve trade-offs:

Parameter Larger value Smaller value
Substring length \(L\) More specific (fewer false hits), more memory Less specific (more false hits), less memory
Extraction interval Faster indexing, less memory, may miss some matches Slower indexing, more memory, more complete

Specificity in Practice

How unique are k-mers in real genomes? Here are approximate numbers for human chromosome 1 (about 249 Mb):

k-mer length Unique k-mers % of possible Average occurrences
8 65,536 100% ~3,800
10 1,048,372 99.97% ~238
12 16,543,411 98.5% ~15
14 175,891,774 65.5% ~1.4
16 237,533,060 5.5% ~1.05

Notice that even with 16-mers, only about 5.5% of possible 16-mers appear, but almost every one that does appear is unique. This means longer k-mers give us very specific index hits. This is why most aligners use seed lengths of at least 16-22 bp – shorter seeds produce too many spurious hits, overwhelming the extension phase.

Interactive Substring Index Demo

Enter a text and pattern, set substring length, and see how the index is built and searched.


4. Hashing and Hash Functions

Before we go further with indexing, we need to understand a fundamental computer science concept that underlies fast lookups: hashing. If you have never taken a computer science course, don’t worry – the idea is intuitive once you see it.

What is a Hash Function?

Imagine you work in a library and need to file thousands of books so you can find any one instantly. One strategy: assign each book to a numbered shelf based on a simple rule applied to its title. For example, “count the letters in the title and take the remainder when dividing by 100.” Book titles with 312 letters go to shelf 12 (since 312 mod 100 = 12).

A hash function is exactly this kind of rule. It takes an input (of any size) and converts it to a fixed-size number:

\[h: \text{input} \rightarrow \text{integer in range } [0, N)\]

Properties of Good Hash Functions

Property What it means Why it matters
Deterministic Same input always gives same output You can find things again!
Uniform distribution Outputs are spread evenly across the range No shelf gets overcrowded
Fast to compute O(1) or O(k) for k-length input The whole point is speed

How Hash Tables Work

A hash table (also called a hash map or dictionary) uses a hash function to create a fast lookup structure:

  1. Create an array of \(N\) “buckets”
  2. To store a key-value pair: compute \(h(\text{key})\), put the pair in that bucket
  3. To retrieve a value: compute \(h(\text{key})\), go directly to that bucket

The result? O(1) average-case lookup – constant time regardless of how many items are stored. Compare this to searching a sorted list (\(O(\log n)\)) or an unsorted list (\(O(n)\)).

A Simple DNA Hash Function

For DNA k-mers, we can convert each base to a 2-bit number:

Base Binary
A 00
C 01
G 10
T 11

Since we use 2 bits per base, each position can hold 4 values (A=0, C=1, G=2, T=3). We can read a k-mer as a base-4 number to get a unique integer. For a 3-mer, the first base is multiplied by \(4^2 = 16\), the second by \(4^1 = 4\), and the third by \(4^0 = 1\):

\[h(\text{ACG}) = 0 \times 4^2 + 1 \times 4^1 + 2 \times 4^0 = 0 + 4 + 2 = 6\]

This is perfect for DNA because it maps every possible k-mer to a unique integer (for small k), and it is extremely fast to compute.

Collisions

What happens when two different keys hash to the same bucket? This is called a collision. For our library analogy, two books with titles of the same length get assigned to the same shelf.

Common collision resolution strategies:

  • Chaining: Each bucket holds a linked list of all items that hash there
  • Open addressing: If a bucket is full, try the next one (linear probing)

With a good hash function and a large enough table, collisions are rare and lookups remain fast.

Why This Matters for Genomics

Hash tables are everywhere in bioinformatics:

  • K-mer counting (jellyfish, KMC): hash each k-mer, increment its count
  • Sequence indexing (BLAST): hash seed words for fast lookup
  • Genome assembly (de Bruijn graphs): hash k-mers to find overlaps
  • Minimizers: hash k-mers to select representative seeds

Interactive Hash Table Demo

Type DNA k-mers to see them hashed and placed in buckets. Watch collisions happen!


5. How BLAST Works: Putting It All Together

In the previous lecture we learned about dynamic programming alignment (Smith-Waterman) and the seed-and-extend strategy used by LASTZ. Now let’s look at BLAST (Basic Local Alignment Search Tool) – arguably the most widely used program in all of biology – and see how it combines indexing, lookup tables, and local alignment into a remarkably fast homology search engine.

BLAST was introduced in 1990 by Altschul et al. and has been cited over 100,000 times. Its key insight: instead of aligning a query against every position in a database, first identify short exact or near-exact matches (seeds), then extend only the promising ones.

The BLAST Pipeline

BLAST operates in three stages, each progressively more expensive but applied to fewer candidates:

Stage 1: Word matching (very fast, many hits)
    ↓  filter
Stage 2: Ungapped extension (fast, fewer hits)
    ↓  filter
Stage 3: Gapped extension (slow, few hits → final alignments)

Stage 1: The Word Lookup Table

This is where indexing enters the picture. BLAST breaks the query sequence into overlapping words (short subsequences of length \(w\)). For protein BLAST (blastp), the default word size is \(w = 3\); for nucleotide BLAST (blastn), it is \(w = 11\).

For a query protein of length \(m\), this produces \(m - w + 1\) overlapping words:

Query:  M  E  T  H  I  O  N  I  N  E
Words:  MET  ETH  THI  HIO  ION  ONI  NIN  INE
        (w=3, so 10 - 3 + 1 = 8 words)

But here is BLAST’s clever trick: it doesn’t just look for exact word matches. For each query word, BLAST generates a neighborhood – all words that score above a threshold \(T\) when compared using a substitution matrix (like BLOSUM62, which you saw in the previous lecture).

For example, the word MET might have neighbors like MEA, MES, LET – any 3-mer that scores \(\geq T\) against MET using BLOSUM62. Lowering \(T\) increases sensitivity (more neighbors, more potential matches) but slows the search.

The Lookup Table

All neighborhood words are organized into a lookup table – essentially a hash table that maps each word to the query positions where it (or one of its neighbors) appears:

Word Query position(s)
MET 0
MEA 0
LET 0
ETH 1

BLAST then scans the database sequence and checks each database word against this lookup table. Because lookup tables provide O(1) access (the same hash table concept from Section 4), this scan is extremely fast – the dominant cost is simply reading the database.

Connection to Section 4 (Hashing): The BLAST lookup table is a direct application of the hash tables we discussed earlier. Each word is hashed to a bucket, and checking whether a database word has a query match is a constant-time hash lookup. This is what makes BLAST so fast: the expensive Smith-Waterman alignment is never run unless the lookup table produces a hit.

Stage 2: Ungapped Extension

Each lookup table hit (called a seed) is extended in both directions without gaps, computing a running alignment score. Extension continues as long as the score stays close to its running maximum. If the score drops by more than a threshold \(X\) below the maximum (the X-drop criterion), extension stops.

If the resulting ungapped alignment – called a High-scoring Segment Pair (HSP) – scores above a cutoff, it passes to the next stage. Most seeds are eliminated here.

Connection to Lecture 13: This is exactly the ungapped extension step you saw in the LASTZ pipeline. The X-drop criterion is the same idea.

Stage 3: Gapped Extension

Surviving HSPs are extended using gapped Smith-Waterman alignment (the full dynamic programming algorithm from Lecture 13). This is the most computationally expensive step, but by this point only a tiny fraction of the database remains as candidates.

The final alignments are scored, and statistical significance is assessed using E-values (expected number of alignments with this score or better that would occur by chance in a database of this size).

BLAST vs. Newer Aligners

Feature BLAST Bowtie2 / BWA-MEM
Primary use Query vs. database (homology search) Reads vs. reference genome (mapping)
Index Lookup table from query words FM index of reference genome
Word size 3 (protein), 11 (nucleotide) 22 (Bowtie2), variable MEMs (BWA-MEM)
Index built on Query sequence Reference genome (built once)
Substitution matrix BLOSUM62 (protein), match/mismatch (DNA) Match/mismatch penalties
Sensitivity High (neighborhood words) Tuned for fast mapping
Speed Fast for database search Faster for read mapping

The critical difference: BLAST indexes the query and scans the database, while Bowtie2/BWA-MEM index the reference (once) and map reads against it. When you have billions of short reads to map against a single reference, pre-indexing the reference is far more efficient.

Key Reference: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local alignment search tool.” Journal of Molecular Biology 215, 403-410 (1990). DOI: 10.1016/S0022-2836(05)80360-2


6. Tries

Now let’s move from hash-based lookups to tree-based data structures. A trie (often pronounced to rhyme with “try” to distinguish it from “tree,” though both pronunciations are used – the name comes from retrieval) is a tree-based data structure where:

  • Each edge is labeled with a single character
  • Each node represents a prefix of some stored key
  • There is at most one outgoing edge per character from any node
  • Keys are spelled out along root-to-leaf paths

Why Tries?

Tries give us O(m) lookup time for a query of length \(m\) – independent of how many keys are stored. This is ideal for searching patterns in indexed text.

Why tries matter in biology: Tries are the foundation for tools like Kraken (for taxonomic classification of metagenomic reads), which builds a trie of k-mers labeled with taxonomic IDs to classify millions of reads per minute.

Using Tries for Substring Indexing

We can build a trie from all substrings of length \(L\) extracted from a text \(T\):

  1. Extract each L-mer and its position
  2. Insert each L-mer into the trie, storing position(s) at the leaf
  3. To query: walk down the trie following the pattern characters

If at any point there is no edge matching the next character, the pattern does not exist in the index.

Interactive Trie Builder

Enter a text and substring length to build a trie. Type a pattern in the search field to see the path traced through the trie.

Step-by-Step Trie Construction

Watch how the trie grows as each substring is inserted one at a time. New nodes and edges added at each step are highlighted in green.

Step-by-Step Trie Construction

Step through each substring insertion to see the trie grow incrementally.

Step 0 / 0

7. Suffix Tries

The substring index approach requires choosing a fixed length \(L\). But what if we want to find any substring, of any length? The answer is to index all suffixes of the text. This is directly relevant to genomics: when aligning a read, we don’t know in advance how long the matching region will be (due to mutations, indels, and varying read lengths), so we need an index that can handle queries of any length.

What is a Suffix?

A suffix of a string \(T\) is any substring that extends to the end. For $T = $ abaaba:

Index Suffix
0 abaaba
1 baaba
2 aaba
3 aba
4 ba
5 a

The Sentinel Character

We append a special character $ to the end of the text. The $ is lexicographically smaller than any other character and appears nowhere else. This ensures:

  1. No suffix is a prefix of another suffix
  2. Every suffix ends at a unique leaf in the trie

So $T = $ abaaba$ has suffixes: abaaba$, baaba$, aaba$, aba$, ba$, a$, $.

Building a Suffix Trie

A suffix trie is simply a trie built from all suffixes of \(T\). Insert each suffix one by one.

Operations on a Suffix Trie

Given a suffix trie of text \(T\) and a query pattern \(P\):

  1. Is \(P\) a substring of \(T\)? Follow the path for \(P\). If you can follow all characters, yes.
  2. Is \(P\) a suffix of \(T\)? Follow the path for \(P\). The path must end at a node with a $ edge leading to a leaf.
  3. How many times does \(P\) occur? Follow the path for \(P\), then count the number of leaves in the subtree below.
  4. What is the longest repeated substring? Find the deepest internal node (node with multiple children that is not a leaf).

Worked Example: abaaba$

Let’s trace through searching for aba in the suffix trie of abaaba$:

  • Start at root
  • Follow edge a – found, continue
  • Follow edge b – found, continue
  • Follow edge a – found, arrived at a node

Since we followed all characters successfully, aba is a substring. The subtree below has 2 leaves (positions 0 and 3), so aba occurs 2 times.

Now search for abc:

  • Follow edge a – found
  • Follow edge b – found
  • Follow edge cno such edge! Fall off the trie. abc is not a substring.

Interactive Suffix Trie

Enter a short string to build its suffix trie. Search for patterns to see the path traced (green = match, red = fall-off).

Step-by-Step Suffix Trie Construction

Watch each suffix being inserted one at a time. The newest suffix’s path is highlighted in green so you can see how the trie grows with each insertion.

Step-by-Step Suffix Trie Construction

Each step inserts one suffix. Green highlights show nodes added or traversed for the latest suffix.

Step 0 / 0

8. From Tries to Trees (Suffix Trees)

The Problem with Suffix Tries

Suffix tries are conceptually clean, but they have a major practical problem: size. A suffix trie for a text of length \(n\) can have up to \(O(n^2)\) nodes. For a genome of 3 billion base pairs, that is simply too much.

The Solution: Collapse Non-Branching Paths

A suffix tree is obtained by taking a suffix trie and collapsing every path of non-branching nodes into a single edge. Instead of one character per edge, each edge is labeled with a substring.

For example, in a suffix trie if you have a chain: root –a–> node1 –b–> node2 –a–> leaf, and node1 and node2 each have only one child, we collapse this to: root –aba–> leaf.

Properties of Suffix Trees

Property Suffix Trie Suffix Tree
Nodes \(O(n^2)\) \(O(n)\)
Edges \(O(n^2)\) \(O(n)\)
Edge labels Single characters Substrings
Build time \(O(n^2)\) \(O(n)\) (Ukkonen’s algorithm)

The suffix tree has at most \(2n\) nodes (for a text of length \(n\)), making it linear in size.

Same Operations, More Compact

All the operations we could do with a suffix trie work on suffix trees too:

  • Check if a pattern exists: follow edges, potentially matching partway through an edge
  • Count occurrences: count leaves below the match point
  • Find longest repeated substring: deepest internal node (counting total characters on path)

Suffix Trees in Practice: MUMmer

The MUMmer tool uses suffix trees to rapidly align whole genomes against each other. It finds Maximal Unique Matches (MUMs) – long stretches that occur exactly once in each genome – as anchors for whole-genome alignment. MUMmer was used to compare the first assembled human genomes and remains the go-to tool for bacterial genome comparisons, capable of aligning two 5 Mb bacterial genomes in seconds.

Step-by-Step: From Suffix Trie to Suffix Tree

Watch the suffix trie being built suffix by suffix, then see it collapse into a suffix tree. This shows both the quadratic growth of the trie and the dramatic compression when non-branching paths are merged.

Step-by-Step: Build Suffix Trie, Then Collapse to Tree

First build the suffix trie step by step, then click "Collapse to Tree" to watch non-branching paths merge.

Step 0 / 0
Suffix Trie (building...)

Memory Limitation

Despite having \(O(n)\) nodes, suffix trees have large constant factors. Each node stores pointers, edge labels, and suffix links. In practice, a suffix tree for the human genome requires approximately 47 GB of RAM – feasible on a server, but not ideal.

Suffix Trie to Suffix Tree Comparison

See a suffix trie and its collapsed suffix tree side by side. Search for patterns in the tree.

Suffix Trie (O(n2) nodes)
Suffix Tree (O(n) nodes)

9. Suffix Arrays

An Even More Compact Representation

Suffix trees solve the node-count problem, but they still carry a lot of overhead: pointers, edge labels, suffix links. Can we do better?

The trade-off we will accept: suffix arrays give up the \(O(m)\) query time of suffix trees, settling for \(O(m \log n)\) with binary search. But the memory savings – from ~47 GB down to ~12 GB for the human genome – make this trade-off worthwhile in practice.

Biological application: Suffix arrays are used internally by many alignment tools and are the basis for the enhanced suffix array used by STAR (the most popular RNA-seq aligner).

A suffix array is a beautifully simple data structure: it is just an array of integers representing the sorted order of all suffixes.

Construction

Given text \(T\) with sentinel $:

  1. List all suffixes with their starting positions
  2. Sort the suffixes lexicographically
  3. Store only the starting positions in sorted order

For $T = $ abaaba$:

Rank Starting Position Suffix
0 6 $
1 5 a$
2 2 aaba$
3 3 aba$
4 0 abaaba$
5 4 ba$
6 1 baaba$

The suffix array is: [6, 5, 2, 3, 0, 4, 1]

Memory

A suffix array for a text of length \(n\) requires just \(n\) integers. Using 4 bytes per integer:

\[\text{Human genome}: 3 \times 10^9 \times 4 \text{ bytes} = 12 \text{ GB}\]

This is a significant improvement over the 47 GB for a suffix tree.

Relationship to Suffix Trees

There is an elegant connection: if you perform a depth-first traversal of a suffix tree, visiting children in alphabetical order, the order in which you visit the leaves gives you the suffix array.

Interactive Suffix Array

Enter a text to see the suffix array constructed. Search for a pattern to see binary search in action.


10. Burrows-Wheeler Transform (BWT)

We’ve been building toward this: the BWT is the core data structure inside both Bowtie2 and BWA-MEM – the two aligners you will use most often in practice.

TipWhy not just use suffix arrays?

Suffix arrays are compact compared to suffix trees (12 GB vs. 47 GB for the human genome), but 12 GB is still too large to fit in the RAM of most workstations. The BWT achieves something remarkable: it can be compressed to under 1 GB while still supporting fast pattern search. This is why it – not the suffix array directly – became the foundation for practical alignment tools.

The BWT is intimately connected to the suffix array we just studied. The sorted rotations in the BWT matrix are the same as the sorted suffixes – the suffix array gives the starting positions, and the BWT gives the last character of each rotation. The FM index, which we will build in Section 11, exploits this connection to combine the BWT’s compressibility with the suffix array’s position information.

The Burrows-Wheeler Transform is one of the most elegant algorithms in computer science. Originally invented for data compression (1994), it turned out to be the key ingredient for modern read aligners. The BWT rearranges a string in a way that is simultaneously compressible, reversible, and searchable.

Step-by-Step Construction

Given a string \(T\) (with $ appended – recall from Section 7, $ is a sentinel character smaller than all others, ensuring each suffix and each rotation is unique):

  1. Write all rotations of \(T\)
  2. Sort the rotations lexicographically
  3. Take the last column – this is the BWT

A rotation of a string means moving the first character to the end. So abaaba$baaba$aaaba$ab → and so on, until you cycle back to the original string.

Worked Example

Let $T = $ abaaba$:

Step 1: All rotations

Rotation
abaaba$
baaba$a
aaba$ab
aba$aba
ba$abaa
a$abaab
$abaaba

Step 2: Sort lexicographically

Row Sorted Rotation First Last
0 $abaaba $ a
1 a$abaab a b
2 aaba$ab a b
3 aba$aba a a
4 abaaba$ a $
5 ba$abaa b a
6 baaba$a b a

Step 3: The BWT is the last column: abba$aa

Three Remarkable Properties

Property 1: Compressible

The BWT tends to group identical characters together. In our example, abba$aa has runs of repeated characters. This makes it highly amenable to run-length encoding – a key property for compression (used in bzip2).

Property 2: Reversible

Despite appearing to scramble the text, the BWT can be perfectly reversed to recover the original text. This uses the LF mapping property.

Property 3: Searchable

The BWT supports fast pattern search without decompressing – this is what makes it useful for genomics.

The LF Mapping Property

The LF mapping states: the \(i\)-th occurrence of character \(c\) in the Last column (L) corresponds to the \(i\)-th occurrence of \(c\) in the First column (F).

Concrete example – tracing one character step by step:

Look at row 0 of our BWT matrix. The L column has a at row 0. This is the 1st a in L (scanning top to bottom). By the LF mapping, it maps to the 1st a in F – which is at row 1.

         F               L
Row 0:   $  abaaba       a  ← 1st 'a' in L  ──→  maps to 1st 'a' in F (row 1)
Row 1:   a  $abaab       b
Row 2:   a  aba$ab       b
Row 3:   a  ba$aba       a  ← 2nd 'a' in L  ──→  maps to 2nd 'a' in F (row 2)
Row 4:   a  baaba$       $
Row 5:   b  a$abaa       a  ← 3rd 'a' in L  ──→  maps to 3rd 'a' in F (row 3)
Row 6:   b  aaba$a       a  ← 4th 'a' in L  ──→  maps to 4th 'a' in F (row 4)

Notice the pattern: the \(i\)-th occurrence of any character in L always maps to the \(i\)-th occurrence of that same character in F.

Why does this work? Let’s trace one example and then state the general rule.

Concrete trace: Look at the 2nd a in L, which is at row 3. Row 3’s rotation is aba$aba. The last character (L) is a, and the rotation that follows this a is ba$aba – which is the rotation at row 3 itself. Now, the 2nd a in F is at row 2, whose rotation is aaba$ab. This rotation also sorts to the same relative position among all a-prefixed rotations.

The general argument in three steps:

  1. All rotations starting with a form a contiguous block in the sorted matrix (rows 1-4 in F). Within this block, they are sorted by what comes after the leading a.
  2. Each a in L represents the character just before its row’s rotation. The rotation that follows each of these a’s is a complete rotation of the original text.
  3. Sorting these “following rotations” gives the same relative order as sorting the F-block, because it’s the same set of strings. So the 1st a in L and the 1st a in F refer to the same occurrence, the 2nd to the 2nd, and so on.

This holds for every character, not just a.

Reversing the BWT

Using LF mapping, we can recover the original text:

  1. Start at the row containing $ in the F column (row 0)
  2. Read character from L column: a
  3. Use LF mapping to jump to the corresponding row
  4. Repeat until $ is reached
  5. Read characters in reverse to get the original text

Interactive Burrows-Wheeler Transform

Enter a string to see the BWT constructed step by step. Then try reversing it and searching for patterns.


11. FM Index

The BWT is powerful, but it lacks one critical piece of information: where in the original text do the matches occur? The BWT tells us how many matches exist (from the range size), but not their positions.

The FM index (Ferragina and Manzini, 2000) combines the BWT with auxiliary data structures to create a complete, practical search index.

Components of the FM Index

Component What it stores Size (human genome)
F column Just the counts of each character Negligible (~16 bytes)
L column (BWT) The Burrows-Wheeler Transform ~750 MB (with compression)
Tally/Checkpoint table Cumulative character counts at intervals ~400 MB
SA sample Every \(k\)-th suffix array value ~100 MB
Total ~1.25 GB

Compare this to: raw genome (3 GB), suffix array (12 GB), suffix tree (47 GB). The FM index is remarkably compact.

How the Tally Table Works

Instead of counting characters in L on every query (which would be \(O(n)\)), we precompute checkpoints: cumulative counts of each character at regular intervals (say, every 128 rows).

To count occurrences of character \(c\) in \(L[0..i]\): 1. Find the nearest checkpoint before position \(i\) 2. Read the precomputed count 3. Scan the few rows between the checkpoint and \(i\)

This gives us \(O(1)\) effective lookup time.

How Position Lookup Works

The SA sample stores suffix array values at every \(k\)-th position. To find the text position for a BWT match at row \(i\):

  1. If \(SA[i]\) is stored, return it directly
  2. Otherwise, follow LF mapping steps until you reach a stored position
  3. Add the number of steps taken

Since we follow at most \(k\) LF steps, position lookup is \(O(k)\).

The Search Algorithm

The idea is beautifully simple: we process the pattern from right to left, one character at a time. Why backward? Because the F column is sorted, so each character narrows our search to a contiguous block of rows. Processing right-to-left means we are always extending a prefix in the sorted rotation matrix – which is exactly what binary search on the suffix array would do, but we achieve it in \(O(1)\) per character using the precomputed tables instead of \(O(\log n)\).

At each step, we maintain a range \([lo, hi]\) in the sorted rotation matrix. This range represents all rows whose corresponding suffixes start with the portion of the pattern we have processed so far. Adding one more character to the left narrows this range using two lookups in our precomputed tables.

The full FM index search for pattern \(P[0..m-1]\):

Initialize: lo = 0, hi = n - 1
For i = m-1 down to 0:
    c = P[i]
    lo = C[c] + Occ(c, lo - 1)
    hi = C[c] + Occ(c, hi) - 1
    If lo > hi: pattern not found
Return: hi - lo + 1 occurrences

Where: - \(C[c]\) = number of characters in \(T\) that are lexicographically smaller than \(c\) (the start of \(c\)’s block in F) - \(Occ(c, i)\) = number of occurrences of \(c\) in \(L[0..i]\) (computed using the tally table)

Note: By convention, \(Occ(c, -1) = 0\) – there are zero occurrences of any character before position 0. This handles the edge case when \(lo = 0\).

Worked Example: Searching for "aba" in abaaba$

Let’s trace the algorithm on our running example. From the BWT matrix in Section 10:

Row F L SA
0 $ a 6
1 a b 5
2 a b 2
3 a a 3
4 a $ 0
5 b a 4
6 b a 1

Precomputed tables:

  • \(C[\$] = 0\), \(C[a] = 1\), \(C[b] = 5\)   (count of characters lexicographically smaller)
  • \(L = [a, b, b, a, \$, a, a]\)   (the BWT)

Initialize: \(lo = 0,\; hi = 6\)

Iteration 1 (\(i = 2\), \(c = P[2] =\) a, the rightmost character):

\[lo = C[a] + Occ(a, -1) = 1 + 0 = 1\] \[hi = C[a] + Occ(a, 6) - 1 = 1 + 4 - 1 = 4\]

Range \([1, 4]\): these are the 4 rows whose suffixes start with a.

Iteration 2 (\(i = 1\), \(c = P[1] =\) b):

\(Occ(b, 0)\): count b in \(L[0..0] = [a]\) → 0.   \(Occ(b, 4)\): count b in \(L[0..4] = [a, b, b, a, \$]\) → 2.

\[lo = C[b] + Occ(b, 0) = 5 + 0 = 5\] \[hi = C[b] + Occ(b, 4) - 1 = 5 + 2 - 1 = 6\]

Range \([5, 6]\): these are the 2 rows whose suffixes start with ba.

Iteration 3 (\(i = 0\), \(c = P[0] =\) a, the leftmost character):

\(Occ(a, 4)\): count a in \(L[0..4] = [a, b, b, a, \$]\) → 2.   \(Occ(a, 6)\): count a in \(L[0..6]\) → 4.

\[lo = C[a] + Occ(a, 4) = 1 + 2 = 3\] \[hi = C[a] + Occ(a, 6) - 1 = 1 + 4 - 1 = 4\]

Range \([3, 4]\): 2 occurrences found. The suffix array tells us their text positions: \(SA[3] = 3\) and \(SA[4] = 0\). Indeed, aba appears at position 0 (abaaba$) and position 3 (aba$).

Each iteration took just two table lookups (\(C\) and \(Occ\)) – no character-by-character scanning. For a genome-scale FM index, searching a 100 bp pattern takes exactly 100 iterations regardless of genome size.

Interactive FM Index

See the full FM index data structure. Search for a pattern and watch checkpoints and tally lookups highlighted step by step.


12. Bowtie2 Algorithm

Now that we understand the FM index, let’s see how modern read aligners use it. Bowtie2 (Langmead & Salzberg, 2012) is one of the most widely used short-read aligners.

Overview

Bowtie2 uses a seed-and-extend strategy:

  1. Seed: Extract short substrings from the read and find exact matches using the FM index
  2. Extend: Use dynamic programming to extend seeds into full alignments

How Bowtie2 Uses the FM Index for Seeding

Each seed is looked up in the FM index using the backward search from Section 11. The algorithm processes the seed characters from right to left, narrowing the \([lo, hi]\) range at each step. If the range is non-empty at the end, the seed has exact matches in the reference, and the suffix array sample provides their genomic positions.

Bowtie2 also allows a small number of mismatches within seeds. It achieves this by trying multiple “seed variants” – backtracking during the FM index search to substitute alternative characters at certain positions. This is more expensive than pure exact matching but catches seeds that differ from the reference by one or two bases.

Seed Extraction

Bowtie2 extracts seeds of configurable length (default: 22 bp for end-to-end mode) from the read. Seeds can be extracted at regular intervals along the read.

For a 100 bp read with seed length 22 and interval 11:

Read:   ACGTACGTACGTACGTACGTACGTACGTACGTACGT...
Seed 1: ACGTACGTACGTACGTACGTAC (pos 0-21)
Seed 2:            TACGTACGTACGTACGTACGTA (pos 11-32)
Seed 3:                       CGTACGTACGTACGTACGTAC (pos 22-43)
...

Worked Example: Aligning a Read with a SNP

Consider a 50 bp read that has a single G→A SNP at position 25:

Reference: ...ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC...
Read:       ACGTACGTACGTACGTACGTACGTAACGTACGTACGTACGTACGTACGTAC
                                    ^
                                   SNP (G→A at pos 25)

Bowtie2 extracts seeds (length 22, interval 11):

Seed 1: ACGTACGTACGTACGTACGTACG  (pos 0-21)  → exact FM index match ✓
Seed 2: TACGTACGTACGTACGTACGTACG (pos 11-32)  → contains SNP, no exact match ✗
Seed 3: ACGTAACGTACGTACGTACGTACG (pos 22-43)  → contains SNP, no exact match ✗

Seed 1 hits the reference. Bowtie2 uses that hit as an anchor and extends the alignment in both directions using dynamic programming. The extension discovers the G→A mismatch at position 25 and produces a full alignment with one penalty.

Scoring Model

Bowtie2 uses a configurable scoring model:

Event Default Score
Match 0 (no bonus in end-to-end mode)
Mismatch -6
Gap open -5
Gap extend -3
Read gap open -5
Read gap extend -3

Why does a match score 0? Bowtie2 uses a penalty-based scoring system in end-to-end mode, where the maximum possible score is 0 (a perfect match with no mismatches or gaps). Every error subtracts from this score. This is different from the reward-based scoring you may be used to from Smith-Waterman. In local mode, matches do receive a bonus (+2 by default).

Alignment Modes

Mode Description Use when
End-to-end Align entire read, penalize unaligned ends Standard short-read alignment
Local Allow soft-clipping of read ends Reads may have adapter contamination

Handling Multimapping and MAPQ

When a read maps to multiple genomic locations, Bowtie2 assigns a mapping quality (MAPQ) score reflecting confidence that the reported position is correct. MAPQ is a Phred-scaled probability: \(MAPQ = -10 \log_{10}(P_{\text{wrong}})\). A MAPQ of 0 means the read maps equally well to multiple locations; a MAPQ of 42 (Bowtie2’s maximum) means high confidence in a unique placement. Downstream tools like variant callers use MAPQ to weight or filter alignments.

Parameter Tuning Guide

Parameter Default When to change
--seed-len (-L) 22 Lower (e.g., 16-20) for divergent references or high mutation rates; higher (e.g., 28-32) to increase speed at the cost of sensitivity
--seed-interval (-i) f(read length) Lower for higher sensitivity, higher for speed
--local off Use when reads may contain adapter sequence or when you want soft-clipping
-N 0 Set to 1 to allow one mismatch in seeds (slower but more sensitive)
-k 1 (best) Increase to report multiple alignments (e.g., -k 10 for multimapping analysis)

Key Features

  • FM index for memory-efficient seed finding
  • Supports gapped alignment (unlike Bowtie1)
  • Multiple alignment modes (best, all valid, up to \(k\))
  • Paired-end aware – uses mate pair information to improve mapping
  • Fast for short reads – optimized for reads up to ~250 bp

13. BWA-MEM Algorithm

BWA-MEM (Li, 2013) is the other dominant short-read aligner. It also uses the FM index but with a different seeding strategy.

Maximal Exact Matches (MEMs)

Instead of fixed-length seeds, BWA-MEM finds Maximal Exact Matches (MEMs): the longest substrings of the read that match the reference exactly.

A MEM is maximal in the sense that it cannot be extended in either direction without introducing a mismatch.

How MEMs Are Found Using the FM Index

BWA-MEM finds MEMs by extending the FM index backward search until the match range becomes empty. Starting at each position \(i\) in the read:

  1. Begin a backward search from position \(i\), processing characters leftward
  2. At each step, the \([lo, hi]\) range narrows. Track the range size
  3. When the range becomes empty (no matches), stop – the longest match ending at \(i\) was the step just before the range emptied
  4. Then extend rightward from position \(i\) as well, checking if the match can grow
  5. The resulting longest bidirectional exact match is a MEM

This is more efficient than it sounds: BWA-MEM processes all positions in a single sweep using the FM index, reusing range information from adjacent positions.

Super-Maximal Exact Matches (SMEMs)

BWA-MEM further refines MEMs into SMEMs: MEMs that are not contained within any longer MEM. An SMEM is the longest exact match covering each read position.

Consider aligning a read against a reference where there is a single SNP (T→G) at read position 12:

Reference: ACGTACGTACGT T CGTACGTAC
Read:      ACGTACGTACGT G CGTACGTAC
           |           |           |
           pos 0       pos 12      pos 21

SMEM 1: ACGTACGTACGT  (read pos 0-11,  length 12) — exact match left of SNP
SMEM 2: CGTACGTAC     (read pos 13-21, length 9)  — exact match right of SNP

Neither SMEM can be extended without hitting the mismatch at position 12. Together, they cover the entire read except the SNP position – giving BWA-MEM two anchor points for alignment.

Chaining

Once SMEMs are found, BWA-MEM groups them into chains: sets of SMEMs that are collinear (they appear in the same order and with consistent spacing in both the read and the reference).

Read:        [---SMEM1---]---[--SMEM2--]------[---SMEM3---]
Reference:   [---SMEM1---]---[--SMEM2--]------[---SMEM3---]
                ↑ same order, consistent gaps → forms one chain

Read:        [---SMEM1---]---[--SMEM2--]
Reference:       [--SMEM2--]---[---SMEM1---]
                ↑ reversed order → different chains (possible inversion)

Chaining is critical because it distinguishes true alignments from spurious seed hits. A chain of collinear SMEMs strongly suggests a real genomic origin, while isolated seeds may be coincidental matches in repetitive regions.

The BWA-MEM Pipeline

  1. SMEM finding: Use FM index to find all SMEMs (minimum length: 19 bp by default)
  2. Re-seeding: For very long SMEMs (> 28 bp), re-seed with shorter fragments to handle mismatches within long matching regions
  3. Chaining: Group SMEMs that are collinear (consistent order and spacing) into chains
  4. Smith-Waterman: Extend the best chains using banded Smith-Waterman dynamic programming to produce full alignments with gaps and mismatches
  5. Pairing: For paired-end reads, resolve pairs using insert-size distribution

Handling Repetitive Regions

Repetitive sequences (e.g., Alu elements, LINE-1, centromeric repeats) are a major challenge. A seed from a repetitive region may match thousands of positions. BWA-MEM handles this by:

  • Dropping overrepresented seeds: If an SMEM matches more than 500 positions (configurable with -c), BWA-MEM discards it and relies on neighboring, more specific seeds
  • Re-seeding: Long SMEMs in repetitive regions are broken into shorter fragments that may be more specific
  • MAPQ scoring: Reads in repetitive regions receive low MAPQ scores, flagging uncertainty for downstream analysis

Key Features

  • Flexible read length: Works well from 70 bp to 1 Mbp
  • Split-read alignment: Can align chimeric reads (important for structural variants)
  • Better for longer reads: The MEM seeding strategy scales well with read length
  • Robust to errors: Re-seeding handles higher error rates

14. Bowtie2 vs. BWA-MEM Comparison

Both tools use the FM index at their core, but they differ in strategy and strengths.

Side-by-Side Comparison

Feature Bowtie2 BWA-MEM
Index FM index (BWT) FM index (BWT)
Seeding Fixed-length extracted seeds Maximal Exact Matches (MEMs)
Seed length Configurable (default 22) Variable (adaptive)
Extension Dynamic programming Banded Smith-Waterman
Optimal read length 50-250 bp 70 bp - 1 Mbp
Gapped alignment Yes Yes
Split-read Limited Yes
Speed (short reads) Very fast Fast
Speed (long reads) Slower Better
Memory ~3.2 GB (human) ~5.4 GB (human)
Runtime (30x human WGS) ~2-4 hours ~2-6 hours

When to Use Which

Scenario Recommended Tool
Short Illumina reads (< 250 bp), standard WGS/WES Either works well
RNA-seq (with splice-aware aligner like HISAT2) HISAT2 (Bowtie2 family)
Longer reads (PacBio HiFi, Nanopore) BWA-MEM (or minimap2)
Structural variant detection BWA-MEM
ChIP-seq, ATAC-seq Bowtie2
Very large cohort studies (speed critical) Bowtie2
Mixed read lengths BWA-MEM

Practical Recommendations

For most standard Illumina sequencing projects, BWA-MEM is the default choice recommended by the GATK Best Practices pipeline. Its MEM-based seeding is robust and efficient across a wide range of read lengths.

For specialized applications like ChIP-seq peak calling, Bowtie2’s scoring model and reporting options (e.g., reporting only unique alignments) may be preferable.

Bowtie2 vs. BWA-MEM: Seeding Strategy Comparison

See how the same read is processed differently by each algorithm's seeding approach.
Note: This is a simplified visualization. Real MEMs are found by matching against a reference genome using the FM index. Here, we approximate MEMs as contiguous non-N regions to illustrate the concept of variable-length vs. fixed-length seeds.

Bowtie2 (Fixed Seeds)
BWA-MEM (MEMs)

Summary

We have traced a remarkable intellectual journey:

Data Structure / Tool Space Query Time Key Idea
Brute force O(1) O(mn) per read Compare everywhere
Substring index O(nL) O(L + hits) Fixed-length k-mers
Hash table O(n) O(1) average Direct address via hash
BLAST O(query) O(db) scan Word lookup table + extend
Trie O(n L) O(m) Tree-structured prefix lookup
Suffix trie O(n^2) O(m) All suffixes in a trie
Suffix tree O(n) O(m) Collapsed suffix trie
Suffix array O(n) O(m log n) Sorted suffix positions
FM index O(n) compressed O(m) BWT + auxiliary tables

BLAST uses hash-table-based word lookup to speed up homology search against large databases. The FM index goes further: compact storage, fast search, and exact matches in time proportional to the pattern length alone. This is why both Bowtie2 and BWA-MEM – the tools you will use most often for read alignment – are built on top of the FM index rather than BLAST-style lookup tables.

ImportantThe Key Lesson

In computational biology, the right data structure can make the difference between an analysis that takes centuries and one that takes minutes. Algorithmic thinking is not a luxury – it is a necessity.


References

  1. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. “Basic local alignment search tool.” Journal of Molecular Biology 215, 403-410 (1990). DOI: 10.1016/S0022-2836(05)80360-2
  2. Langmead B, Salzberg SL. “Fast gapped-read alignment with Bowtie 2.” Nature Methods 9, 357-359 (2012).
  3. Li H. “Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM.” arXiv:1303.3997 (2013).
  4. Li H, Durbin R. “Fast and accurate short read alignment with Burrows-Wheeler transform.” Bioinformatics 25, 1754-1760 (2009).
  5. Burrows M, Wheeler DJ. “A block-sorting lossless data compression algorithm.” Technical Report 124, Digital Equipment Corporation (1994).
  6. Ferragina P, Manzini G. “Opportunistic data structures with applications.” Proceedings of FOCS 2000 (2000).
  7. Langmead B. Teaching materials and lecture notes, langmead-lab.org/teaching.
  8. Kurtz S et al. “Versatile and open software for comparing large genomes.” Genome Biology 5, R12 (2004). [MUMmer]