Lecture 15: K-mers, Minimizers, and Minimap2

Learning Objectives

By the end of this lecture, you will be able to: - Define k-mers and explain how they are extracted from sequences using a sliding window - Calculate the number of possible k-mers for a given k and explain canonical k-mers - Explain how k-mer counting tools handle billions of k-mers - Interpret a k-mer frequency histogram to estimate genome size, heterozygosity, and repeat content - Explain why full k-mer indexing does not scale and how minimizers solve this problem - Define minimizers and compute them for a given sequence, window size, and k - Describe the four stages of the minimap2 alignment pipeline - Explain the role of two-piece affine gap scoring and the Z-drop heuristic in minimap2 - Explain how Mash uses MinHash sketches to rapidly estimate genome similarity

1. What is a K-mer?

In the previous lecture, we saw how substring indexes and hash tables enable fast sequence lookup. Now we formalize one of the most fundamental concepts in computational genomics: the k-mer.

A k-mer is a contiguous subsequence of length $k$ from a longer sequence. Given a sequence of length $L$, the number of k-mers is:

\[n = L - k + 1\]

For example, the sequence ACGTACG (length 7) contains $7 - 3 + 1 = 5$ distinct 3-mers:

Position	3-mer
0	ACG
1	CGT
2	GTA
3	TAC
4	ACG

Notice that position 0 and position 4 yield the same 3-mer ACG. This is important – k-mers can repeat, and counting those repetitions is the foundation of k-mer analysis.

How Many Possible K-mers?

Since DNA has 4 bases (A, C, G, T), the total number of distinct k-mers of length $k$ is:

\[4^k\]

k	Possible k-mers	Context
3	64	Codons
5	1,024	Small seeds
11	4,194,304	~4 million
15	1,073,741,824	~1 billion
21	4,398,046,511,104	~4.4 trillion
31	~4.6 $\times 10^{18}$	Exceeds any genome

Key Insight

For $k \geq 18$, the number of possible k-mers ($4^{18} \approx 6.9 \times 10^{10}$) exceeds the length of the human genome (~$3 \times 10^9$ bp). This means most k-mers of length 18 or longer appear at most once in a genome, making them excellent unique identifiers for genomic positions.

Reverse Complements and Canonical K-mers

DNA is double-stranded. A k-mer on the forward strand has a reverse complement on the reverse strand. For example:

Forward: ACGTAC
Reverse complement: GTACGT

When we count k-mers, we usually do not know which strand a read came from. To handle this, we use canonical k-mers: for each k-mer, we take the lexicographically smaller of the k-mer and its reverse complement.

\[\text{canonical}(s) = \min(s, \text{revcomp}(s))\]

For ACGTAC vs GTACGT: since ACGTAC < GTACGT lexicographically, the canonical form is ACGTAC.

Why Canonical K-mers Matter

Using canonical k-mers ensures that a k-mer and its reverse complement are counted together. This halves the number of distinct k-mers we need to track and correctly handles reads from either strand. Most k-mer counting tools (Jellyfish, KMC) use canonical k-mers by default.

Interactive K-mer Extraction Demo

Enter a DNA sequence and k value to see the sliding window extract k-mers. Toggle canonical mode to see reverse complement handling.

DNA Sequence:

Canonical

Step through:

2. K-mer Counting

Counting k-mers – determining how many times each k-mer appears in a dataset – is one of the most fundamental operations in genomics. It is the foundation for genome assembly, error correction, variant detection, and genome profiling. The challenge is scale: a typical Illumina sequencing run produces billions of k-mers that must be counted efficiently.

Tools like Jellyfish (Marcais and Kingsford, Bioinformatics 2011) and KMC (Deorowicz et al., Bioinformatics 2015; KMC 3) solve this problem using hash tables and disk-based sorting, respectively. We will revisit KMC’s clever partitioning strategy later, once we have covered minimizers.

Error K-mers vs True K-mers

Sequencing errors create k-mers that do not exist in the true genome. A single base error in a read produces up to $k$ erroneous k-mers (since the error falls within $k$ overlapping windows). These error k-mers typically appear only once in the dataset, while true genomic k-mers appear many times (proportional to sequencing depth).

Key Insight

Error k-mers vastly outnumber true k-mers in count – a typical dataset may contain more distinct error k-mers than true k-mers. But error k-mers appear at frequency 1, while true k-mers from a haploid genome cluster around the average coverage (in a diploid genome, homozygous k-mers appear at ~$C$ and heterozygous k-mers at ~$C/2$). This frequency separation is the basis of k-mer frequency histogram analysis and error correction.

Choosing k

The choice of $k$ involves a fundamental trade-off:

k too small	k too large
Many k-mers map to multiple locations (ambiguity)	Sequencing errors break k-mers (sensitivity loss)
Poor specificity for large genomes	Higher memory requirements
Shorter than repeat elements	May exceed read length

A common rule of thumb: choose $k$ such that $4^k$ is much larger than the genome size, but $k$ is small relative to the read length. Typical values: $k = 21$ to $31$ for genome assembly, $k = 31$ for k-mer frequency histogram analysis. The tool KmerGenie (Chikhi and Medvedev, Bioinformatics 2014) automates this choice by sampling k-mer abundance histograms across a range of $k$ values and selecting the one that optimizes the expected assembly quality.

Interactive K-mer Counting & Frequency Histogram

Enter a sequence (or use the default), set k, and see the k-mer frequency distribution. Notice how repeated k-mers form peaks.

Sequence:

3. The K-mer Frequency Histogram

The k-mer frequency histogram (also called the k-mer abundance histogram) is a plot of how many distinct k-mers appear at each frequency (multiplicity). It is one of the most informative views of a sequencing dataset – from this single plot, we can estimate genome size, heterozygosity, repeat content, and sequencing error rate, all without a reference genome.

Terminology: Spectrum vs Histogram

The term “k-mer spectrum” is sometimes used loosely to refer to this histogram, but strictly speaking the k-mer spectrum of a genome is the set of all distinct k-mers it contains – much richer information than just the histogram of their counts. In this lecture we use k-mer frequency histogram to refer specifically to the count distribution.

Reading the Spectrum

A typical k-mer frequency histogram from whole-genome sequencing has these features:

Feature	Position in Histogram	Meaning
Error peak	Frequency 1-2	K-mers caused by sequencing errors
Homozygous peak	~Coverage	K-mers from regions identical on both haplotypes
Heterozygous peak	~Coverage/2	K-mers unique to one haplotype
Repeat peaks	2x, 3x, … coverage	K-mers from duplicated regions

Understanding the Histogram Shape

The k-mer frequency histogram is a powerful visual summary of a sequencing dataset. To interpret it properly, it helps to understand what generates each feature of the plot.

The error peak (multiplicity ~1). Sequencing errors create k-mers that do not exist in the true genome. Since errors are rare and roughly random, each erroneous k-mer typically appears only once. A single base error in a read corrupts up to $k$ overlapping k-mers, so the total number of error k-mers is large even at low error rates. This produces a tall, sharp peak at multiplicity 1 that dominates the left side of the spectrum. At higher error rates, this peak grows taller and its tail extends slightly further to the right.

The heterozygous peak (multiplicity ~$C/2$). In a diploid genome, positions that differ between the two haplotypes produce k-mers that are present on only one copy. These k-mers are sampled at half the average coverage, producing a peak at approximately $C/2$. The height of this peak is proportional to the heterozygosity rate – highly heterozygous genomes (like many marine invertebrates) show a prominent heterozygous peak, while inbred lines show almost none.

The homozygous peak (multiplicity ~$C$). K-mers from regions identical on both haplotypes are sampled from both copies and appear at approximately the full coverage $C$. This is the main peak of the spectrum and represents the majority of unique genomic sequence. Its position directly indicates the average sequencing depth.

Repeat peaks (multiplicity ~$2C$, $3C$, …). K-mers from genomic regions that are duplicated $n$ times appear at approximately $n \times C$. Two-copy repeats produce a peak at $2C$, three-copy repeats at $3C$, and so on. Genomes with high repeat content (like many plant genomes) show a heavier tail with visible peaks at these higher multiplicities.

How Genome Properties Shape the Spectrum

The relative sizes of these peaks encode genome properties: a large heterozygous peak relative to the homozygous peak indicates high heterozygosity. A heavy tail extending to high multiplicities indicates a repeat-rich genome. A narrow homozygous peak indicates uniform coverage, while a broad peak suggests coverage variation (which is why GenomeScope uses a negative binomial rather than a Poisson distribution). The choice of $k$ also matters – smaller $k$ values increase the chance that k-mers from different genomic locations collide, artificially inflating the higher-multiplicity peaks.

Interactive K-mer Frequency Histogram Simulator

Adjust genome parameters to see how they affect the k-mer frequency spectrum. The simulation uses a mixture of distributions to model error, heterozygous, homozygous, and repeat k-mers.

Coverage (C): 50x

Error rate: 1.0%

Heterozygosity: 1.0%

Repeat fraction: 10%

GenomeScope: Genome Profiling from K-mers

GenomeScope (Vurture et al., Bioinformatics 2017) fits a mixture model to the k-mer frequency histogram to estimate:

Genome size: Total number of unique k-mers divided by the average coverage
Heterozygosity: Ratio of the heterozygous peak to the homozygous peak
Repeat fraction: Area under the high-frequency tail

The model assumes k-mers are sampled from the genome following a negative binomial distribution (to account for coverage variation). GenomeScope 2.0 (Ranallo-Benavidez et al., Nature Communications 2020) extended this to polyploid genomes.

Genome Size Estimation

The formula for genome size from k-mer counting is elegantly simple:

\[G = \frac{N}{C}\]

where $N$ is the total number of k-mers (after error removal) and $C$ is the average k-mer coverage (the position of the homozygous peak). This works because each base in the genome is covered by approximately $C$ k-mers. In practice, GenomeScope fits a negative binomial mixture model to the spectrum to estimate these parameters more precisely, but the core intuition is this simple ratio.

Merqury: Assembly Quality Assessment

Merqury (Rhie et al., Genome Biology 2020) uses k-mer set operations to evaluate genome assemblies without a reference:

Spectra-asm plot: Which read k-mers are found in the assembly? Missing k-mers indicate gaps; extra k-mers indicate errors.
QV estimation: The fraction of k-mers in the assembly not found in the reads gives an estimated base-level quality score.
Hap-mers: K-mers unique to one haplotype enable phasing assessment.

Key Insight

The k-mer frequency histogram is like a fingerprint for your sequencing data. Before running any assembly or alignment, examining it tells you fundamental properties of your genome – size, ploidy, repeat content, error rate – and helps you choose appropriate parameters for downstream analysis.

4. The Scaling Problem

We have seen that k-mers are powerful tools for genome analysis. But there is a problem: storing all k-mers does not scale.

Consider indexing the human genome with 15-mers:

Quantity	Value
Genome length	$3 \times 10^9$ bp
Number of 15-mers	~$3 \times 10^9$
Bytes per entry (k-mer + position)	~12 bytes
Total index size	~36 GB

This is manageable for a single genome. But what about mapping long reads? A 10 kb read generates ~10,000 k-mers. Each must be looked up in the index. With millions of reads, we need billions of lookups.

For more complex tasks – like all-vs-all comparison of reads for overlap detection in genome assembly – the situation is worse. With $N$ reads, we need $O(N^2)$ comparisons, each involving many k-mer lookups.

Key Insight

We do not need to index every k-mer. If we can select a representative subset of k-mers that still covers the genome densely enough to find all true matches, we can dramatically reduce index size and lookup time. This is the key idea behind minimizers.

5. Minimizers

Minimizers (Roberts et al., Bioinformatics 2004) provide an elegant solution to the scaling problem. The idea is simple: instead of using every k-mer as a seed, select only the smallest k-mer within each window of consecutive k-mers.

Definition

Given parameters: - $k$: the k-mer size - $w$: the window size (number of consecutive k-mers in each window)

A minimizer is the lexicographically smallest k-mer among $w$ consecutive k-mers. As the window slides by one position, the minimizer may stay the same or change.

For a sequence of length $L$: - Total k-mers: $L - k + 1$ - Total windows of $w$ consecutive k-mers: $L - k - w + 2$ - Each window selects exactly one minimizer (the smallest k-mer in it)

Why Minimizers Work

The key property: if two sequences share a substring of length $w + k - 1$, they must share at least one minimizer. This is because they share $w$ consecutive k-mers, and the minimum of those k-mers must be the same.

This guarantee means minimizers preserve matching sensitivity while dramatically reducing the number of seeds:

Approach	Seeds per genome	Relative size
Every k-mer	$n$	1x
Every w-th k-mer	$n/w$	$1/w$ (but misses matches)
Minimizers (window $w$)	~$2n/(w+1)$	~$2/(w+1)$ (no missed matches)

The density of minimizers – the fraction of k-mers selected – approaches $2/(w+1)$ in random sequences. With $w = 10$, this means only about 18% of k-mers are kept, an 82% reduction in index size with no loss of matching sensitivity.

Quick Intuition

Think of minimizers as street addresses for sequence regions. Rather than describing every nucleotide, you pick a few landmark k-mers that let you rapidly find where two sequences might align — then do the expensive exact comparison only there.

Practical Window Sizes

Minimap2 uses $k = 15$ and $w = 10$ for long reads, and $k = 21$ and $w = 11$ for short reads. The Kraken 2 metagenomic classifier stores minimizers of length 7 extracted from 31-mers (rather than storing the full 31-mers), achieving an 85% memory reduction compared to Kraken 1.

Interactive Minimizer Selection Demo

See how minimizers are selected: the lexicographically smallest k-mer in each window of w consecutive k-mers. Minimizers are highlighted in red.

DNA Sequence:

6. Minimap2: A Deep Dive

Minimap2 (Li, Bioinformatics 2018) is arguably the most versatile sequence alignment tool in genomics today. It handles short reads, long noisy reads (PacBio, Oxford Nanopore with ~15% error), cDNA, and genome-to-genome alignment – all in a single tool. It is 3-4x faster than BWA-MEM for short reads and 30x or more faster than previous long-read mappers.

The secret to minimap2’s speed and versatility is a four-stage pipeline built on minimizers.

Stage 1: Indexing

Minimap2 builds an index of the reference genome by:

Extracting all $(w, k)$-minimizers from the reference
Storing them in a hash table: minimizer $\rightarrow$ list of positions

Preset	k	w	Use case
`map-ont`	15	10	Oxford Nanopore reads
`map-pb`	19	19	PacBio CLR reads
`sr`	21	11	Short reads (Illumina)
`asm5`	19	19	Assembly-to-assembly ($<$5% divergence)

The index for the human genome is typically ~6 GB in memory – much smaller than a full k-mer index.

Stage 2: Seeding (Anchor Finding)

For each query sequence (read):

Extract the same $(w, k)$-minimizers from the query
Look up each minimizer in the reference index
Each match produces an anchor: a pair $(x, y)$ where $x$ is the reference position and $y$ is the query position

The result is a set of anchors – potential match points between query and reference.

Stage 3: Chaining

Anchors must be grouped into chains representing co-linear matches. Minimap2 uses dynamic programming to find the highest-scoring chain:

\[f(i) = \max \left( s_i, \max_{j < i} \left[ f(j) + s_i - \text{gap\_cost}(i, j) \right] \right)\]

where $s_i$ is the seed score and the gap cost penalizes large jumps in either the reference or query coordinate. The gap cost accounts for both distance between anchors and the difference between reference and query gaps (indels).

Chaining Complexity

A naive implementation of chaining is $O(n^2)$ in the number of anchors. Minimap2 optimizes this with a band-limited approach – it only considers anchors within a distance threshold, achieving practical near-linear performance.

Stage 4: Base-Level Alignment

For each chain, minimap2 performs base-level alignment between the chained anchors. Two key innovations make this practical for long, noisy reads:

Two-piece affine gap scoring:

Instead of the standard affine gap penalty $\gamma(l) = o + e \cdot l$ (gap open $o$ + gap extend $e$ times length $l$), minimap2 uses:

\[\gamma(l) = \min(o_1 + e_1 \cdot l, \; o_2 + e_2 \cdot l)\]

where $(o_1, e_1)$ penalizes short gaps with a lower open cost, and $(o_2, e_2)$ penalizes long gaps with a higher open cost but lower extension cost. This two-piece model better handles the mix of short indels (from sequencing errors) and long gaps (from structural variants or introns).

Z-drop heuristic:

Minimap2 stops extending an alignment when the score drops by more than $Z$ below the maximum score seen so far:

\[\text{Stop if } \; S_{\max} - S_{\text{current}} > Z\]

This prevents wasting time on poor extensions at the ends of alignments, which is critical for long reads where only a portion may align.

Minimap2 Versatility

Mode	What it aligns	Key feature
Long reads to reference	ONT/PacBio to genome	Handles 15% error rate
Short reads to reference	Illumina to genome	3-4x faster than BWA-MEM
Splice-aware	cDNA/RNA to genome	Finds GT-AG splice sites
Assembly to assembly	Contigs to reference	Detects structural variants

Key Insight

Minimap2 demonstrates how minimizers transform sequence alignment. By replacing exact k-mer indexing with minimizer-based seeding, minimap2 achieves massive speedups while maintaining sensitivity. The four-stage pipeline – index, seed, chain, align – has become the dominant paradigm for modern sequence alignment.

Minimap2 Pipeline: Seed-Chain-Align Visualization

Step through the four stages of the minimap2 algorithm. See how minimizer seeds become anchors, chains, and alignments.

Reference:

Query:

7. Mash: Fast Genome Distance Estimation

The Problem: Comparing Thousands of Genomes

Imagine you have 54,118 genomes from NCBI RefSeq and want to know which are closely related. Pairwise whole-genome alignment would require over 1.4 billion comparisons, each taking minutes to hours. Even with fast aligners, this is computationally prohibitive.

What if we could estimate genome similarity in microseconds per pair – without performing any alignment at all?

Genomes as K-mer Sets

We already know how to decompose a sequence into k-mers. A key insight is that we can treat an entire genome as a set of its constituent k-mers:

Two identical genomes share all k-mers
A single mutation destroys $k$ overlapping k-mers
Unrelated genomes share almost no k-mers (for sufficiently large $k$)

This transforms a sequence comparison problem into a set comparison problem. And set comparison has well-studied, efficient solutions.

Jaccard Index: Measuring Set Similarity

The Jaccard index measures the overlap between two sets:

\[J(A, B) = \frac{|A \cap B|}{|A \cup B|}\]

Where $A$ and $B$ are the k-mer sets of two genomes:

Numerator $|A \cap B|$ (intersection): the number of k-mers that appear in both genomes – the shared k-mers.
Denominator $|A \cup B|$ (union): the number of distinct k-mers that appear in either genome (or both), counting each unique k-mer only once.

For example, if genome A has k-mers {ACG, CGT, GTA, TAC} and genome B has {ACG, CGT, GTG, TGA}:

	K-mers	Size
$A \cap B$ (shared)	`ACG`, `CGT`	2
$A \cup B$ (all unique)	`ACG`, `CGT`, `GTA`, `TAC`, `GTG`, `TGA`	6

\[J = \frac{2}{6} = 0.333\]

Dividing by the union rather than the size of one set ensures the measure is symmetric ($J(A,B) = J(B,A)$) and bounded between 0 and 1:

$J = 1$: the sets are identical
$J = 0$: the sets share no elements
$J \approx 0.5$: about half the elements are shared

For two genomes represented as k-mer sets, the Jaccard index captures their overall similarity. If two genomes are 95% identical, their k-mer sets overlap a lot, giving a high Jaccard. If they are distantly related, the Jaccard drops near zero. The problem? A single bacterial genome produces millions of k-mers. Comparing every k-mer between two whole genomes is prohibitively expensive.

K-mer Sketches

A sketch is a small, fixed-size summary of a sequence’s k-mer content that lets you estimate similarity without comparing full k-mer sets. Instead of storing hundreds of millions of k-mers, a sketch compresses them down to, say, 1000 representative values – and you compare those instead.

Quick Intuition

Think of a sketch like a fingerprint. You don’t need to compare every ridge of two fingerprints to know they match – a small set of distinctive features is enough. Sketches do the same for genomes: they capture just enough of the k-mer landscape to reliably estimate how similar two sequences are, at a tiny fraction of the cost.

MinHash Sketching

The most common sketching approach is MinHash (Broder, 1997). The idea:

Hash every k-mer in the sequence to a number
Keep only the $s$ smallest hash values (the “bottom-$s$ sketch”)
Compare sketches instead of full sets

All k-mers:   hash → [0.72, 0.11, 0.95, 0.03, 0.44, 0.67, 0.28, ...]
                                          ↓
Keep bottom s=3:      [0.03, 0.11, 0.28]   ← this is your sketch

To estimate the Jaccard index between two sequences, count how many of the smallest hashes they share:

Sketch A: [0.03, 0.11, 0.28]
Sketch B: [0.03, 0.14, 0.28]

Shared: 0.03 and 0.28  →  J ≈ 2/4 = 0.5

The fraction of shared elements between two sketches approximates the Jaccard index:

\[J(A, B) \approx \frac{|\text{sketch}(A) \cap \text{sketch}(B)|}{|\text{sketch}(A) \cup \text{sketch}(B)|}\]

Why does this work? A random hash function maps set elements uniformly across the hash space. The probability that the minimum hash value comes from the intersection $A \cap B$ equals $\frac{|A \cap B|}{|A \cup B|}$ – exactly the Jaccard index. Keeping the bottom-$s$ values rather than just the minimum gives us $s$ independent estimates, improving accuracy.

With a sketch size of $s = 1000$, each genome is reduced to a fingerprint of just a few kilobytes – regardless of whether the genome is 5 million or 3 billion bases long.

Connection to Minimizers

A minimizer is essentially MinHash applied to a local window. The key differences:

	MinHash Sketch	Minimizers
Scope	Whole sequence	Local window
Purpose	Global similarity	Local alignment anchors
Output	A set of $s$ hash values	A representative k-mer per window

Both techniques rely on the same core insight – select a small number of representative k-mers using hash values – but they serve different goals. Minimizers find local anchors for alignment; MinHash sketches estimate global similarity.

What You Can Estimate From a Sketch

Sketches enable several types of queries:

Jaccard similarity – are these sequences similar?
Containment – is genome A contained within genome B? (useful for metagenomics)
Genome distance – Mash converts Jaccard to an evolutionary distance (see next section)

Tools Built on Sketching

Tool	Sketch Type	Use Case
Mash	MinHash (bottom-$s$)	Genome distance, contamination detection
Sourmash	FracMinHash (scaled)	Metagenome classification, containment
Dashing	HyperLogLog	Ultra-fast cardinality and similarity
HULK	Histogram sketching	Streaming metagenomics

From Jaccard to Mash Distance

The Jaccard index measures k-mer set overlap, but biologists think in terms of mutation rates. Mash (Ondov et al., 2016) converts the Jaccard index to a Mash distance $D$ that approximates the average nucleotide identity (ANI):

\[D = \frac{-1}{k} \ln \frac{2J}{1+J}\]

The intuition: a single nucleotide mutation destroys $k$ overlapping k-mers, so the relationship between mutation rate and k-mer similarity depends on $k$. The formula accounts for this by scaling by $\frac{1}{k}$.

Mash Distance $D$	Approximate ANI	Interpretation
0.0	100%	Identical genomes
0.005	99.5%	Same species (strains)
0.05	~95%	Species boundary
0.3+	<70%	Different genera/families

ANI and the Species Boundary

An ANI of ~95% is widely used as a species delineation threshold for prokaryotes. Mash distances below ~0.05 reliably identify members of the same species, making Mash a practical tool for taxonomic classification.

The Mash Pipeline

The complete Mash pipeline has just three steps:

K-merize each genome (default $k = 21$)
Hash and sketch: apply a hash function, keep the bottom-$s$ values (default $s = 1000$)
Compare sketches: count shared hashes between any two sketches in $O(s)$ time

The critical insight: sketch size is independent of genome size. Comparing two sketches of size 1000 takes the same time whether the genomes are 5 Mb bacteria or 3 Gb mammals. This makes all-vs-all comparison feasible: 54,118 genomes were clustered in just 33 CPU-hours by Ondov et al. (2016).

Interactive Mash Demo

Interactive MinHash Sketch & Mash Distance Demo

Enter two DNA sequences to see how MinHash sketches approximate the Jaccard index and compute Mash distance.

Sequence A:

Sequence B:

k: 5

Sketch size (s): 6

Try It

Start with the default sequences that differ in the middle. Compare the exact Jaccard (purple panel) with the sketch estimate (blue panel) – notice the error in “Sketch vs Exact.”
Make Sequence B identical to Sequence A – both Jaccard values should become 1.0 and the error should drop to 0.
Increase $k$ – what happens to the Jaccard index when sequences differ?
Increase sketch size $s$ – the sketch estimate becomes more accurate (error shrinks toward zero).
Try completely different sequences – the Jaccard drops near 0 and Mash distance becomes large.

Sourmash and Scaled MinHash

The standard MinHash sketch keeps exactly $s$ hashes. An alternative approach, FracMinHash (also called “scaled MinHash”), keeps all hash values below a threshold $\frac{H}{f}$ where $H$ is the maximum hash value and $f$ is a scaling factor (Brown and Irber, 2016). This means sketch size scales proportionally with genome size, which enables containment queries: “what fraction of genome A’s k-mers appear in genome B?” This is particularly valuable for metagenomics, where you want to detect the presence of a small genome within a large mixed sample. The tool sourmash implements FracMinHash and has become widely used for metagenomic analysis.

References

Ondov BD, Treangen TJ, Melsted P, et al. “Mash: fast genome and metagenome distance estimation using MinHash.” Genome Biology 17:132, 2016. DOI: 10.1186/s13059-016-0997-x
Broder AZ. “On the resemblance and containment of documents.” Proceedings of the Compression and Complexity of Sequences, 21-29, 1997.
Brown CT, Irber L. “sourmash: a library for MinHash sketching of DNA.” JOSS 1(5):27, 2016.

8. Summary

This lecture traced the journey from k-mers to modern sequence analysis:

Concept	Key Idea	Tools
K-mers	Sliding window of length k over a sequence; $4^k$ possible	Foundation of all methods
K-mer counting	Hash tables and disk-based sorting to tally k-mers	Jellyfish, KMC 2/3
K-mer frequency histogram	Count distribution reveals genome properties	GenomeScope, Merqury
Minimizers	Smallest k-mer in a window; reduces index by ~80%	Roberts et al. 2004
Minimap2	Index-seed-chain-align pipeline using minimizers	Li 2018
Mash	MinHash sketches for fast genome distance estimation	Ondov et al. 2016

The thread connecting everything: k-mers are the atomic unit of sequence analysis. We count them to profile genomes. We subsample them with minimizers to build efficient indexes. We chain minimizer hits to find alignments. And we use k-mer set operations to evaluate assemblies.

The Big Picture

The progression in this lecture mirrors a recurring theme in computational genomics: start with a brute-force approach, understand its limitations, then find a clever subsampling or indexing strategy that preserves the essential information while dramatically reducing computational cost. K-mers to minimizers is one of the clearest examples of this pattern.

Looking Ahead

K-mers and minimizers appear throughout modern genomics:

Genome assembly: De Bruijn graphs are built from k-mers (SPAdes, hifiasm). Efficient construction of compacted de Bruijn graphs has been advanced by algorithms like TwoPaCo (Minkin et al., Bioinformatics 2017) and BCALM 2 (Chikhi, Limasset, and Medvedev, Bioinformatics 2016), which use minimizer-based partitioning for scalable graph construction.
Metagenomic classification: Kraken 2 maps minimizers to taxonomic labels (building on the original Kraken by Wood and Salzberg, Genome Biology 2014)
Sequence comparison: Mash and sourmash (covered in Section 7) extend to metagenomics, environmental sampling, and outbreak tracking
Error correction: Tools like BFC correct reads based on k-mer frequencies
K-mer set representations: Rahman and Medvedev (RECOMB 2020) introduced spectrum-preserving string sets (SPSS), a compact way to represent a set of k-mers as a collection of strings that preserves all k-mer membership and abundance information. Their UST algorithm finds a near-optimal SPSS by computing a path cover of the compacted de Bruijn graph.
K-mer data structure surveys: A comprehensive review by Marchet, Boucher, Puglisi, Medvedev, Salson, and Chikhi (Genome Research 2021) surveys the landscape of data structures for representing and querying k-mer sets across large collections of sequencing datasets, classifying approaches by their supported operations and space-time trade-offs.

The minimizer concept, born from a 2004 paper on sequence comparison, has become one of the most influential ideas in bioinformatics – enabling tools like minimap2 and Kraken 2 that process data at scales unimaginable two decades ago.