Open In Colab

Lecture 6: FASTQ and Quality Scores

💡 Attribution: This material uses examples from notebooks developed by Ben Langmead

📚 Python concepts in this lecture

This lecture applies concepts from Think Python (3rd edition):

Chapter Concepts used in this lecture

Ch. 9 - Lists Creating lists, append(), slicing [1:], iteration with for, nested lists, list comprehensions

Ch. 10 - Dictionaries Dictionary syntax {'key': value}, lists as dictionary values (used in pandas DataFrames)

Ch. 11 - Tuples Creating tuples (a, b, c), tuple unpacking name, seq, qual = ..., tuple indexing read[2]

Look for (Ch. N) annotations in the code explanations below.

Chapter	Concepts used in this lecture
Ch. 9 - Lists	Creating lists, `append()`, slicing `[1:]`, iteration with `for`, nested lists, list comprehensions
Ch. 10 - Dictionaries	Dictionary syntax `{'key': value}`, lists as dictionary values (used in pandas DataFrames)
Ch. 11 - Tuples	Creating tuples `(a, b, c)`, tuple unpacking `name, seq, qual = ...`, tuple indexing `read[2]`

FASTQ

This notebook explores FASTQ, the most common format for storing sequencing reads.

FASTA and FASTQ are rather similar, but FASTQ is almost always used for storing sequencing reads (with associated quality values), whereas FASTA is used for storing all kinds of DNA, RNA or protein sequences (without associated quality values).

Reading FASTQ with python

Download a sample small FASTQ file:

#!curl -sLO https://zenodo.org/records/10602772/files/fastq_single_end_long.fq

!curl -sLO https://zenodo.org/records/10602772/files/fastq_single_end_short.fq

Now we will use a very simple Python function to read this file and load fastq data into a list.

Understanding the `parse_fastq()` function

📝 New Python concepts: This function introduces function definitions, while loops, string methods, and file handling.

📚 Think Python: Ch. 9 - Lists (lists, append) | Ch. 11 - Tuples (tuples)

Line-by-line breakdown:

Line 1: def parse_fastq(fh): — def creates a function named parse_fastq. fh is a parameter—a placeholder for whatever file handle we pass in when calling the function.
Line 3: reads = [] — Creates an empty list to store our results. (Ch. 9: Lists)
Line 4: while True: — Starts an infinite loop. We’ll use break to exit when done.
Line 5: first_line = fh.readline() — readline() reads one line from the file and moves to the next.
Line 6: if len(first_line) == 0: — When we reach end-of-file, readline() returns an empty string (length 0).
Line 7: break — Exits the while loop immediately.
Lines 8-9: if not first_line.startswith('@'): — startswith('@') returns True if the string begins with @. If not, something’s wrong with the file.
Line 10: name = first_line[1:].rstrip() — [1:] is slicing—takes everything from position 1 onward (skips the @). rstrip() removes trailing whitespace/newlines. (Ch. 9: Slicing)
Lines 11-13: seq = fh.readline().rstrip() — Reads the sequence line, the + line (ignored), and quality line.
Line 14: reads.append((name, seq, qual)) — Creates a tuple (name, seq, qual) and adds it to our list. (Ch. 9: append; Ch. 11: Tuples)
Line 15: return reads — Sends the completed list back to whoever called the function.
Line 17: with open(...) as fq: — Context manager—opens file safely and auto-closes it when done.
Line 20: for read in reads: — Loops through each tuple in the list. (Ch. 9: Iteration)

def parse_fastq(fh):
    """ Parse reads from a FASTQ filehandle.  For each read, we
        return a name, nucleotide-string, quality-string triple. """
    reads = []
    while True:
        first_line = fh.readline()
        if len(first_line) == 0:
            break  # end of file
        if not first_line.startswith('@'):
            raise ValueError(f"Expected '@' at start of FASTQ record, got: {first_line[:20]!r}")
        name = first_line[1:].rstrip()
        seq = fh.readline().rstrip()
        fh.readline()  # ignore line starting with +
        qual = fh.readline().rstrip()
        reads.append((name, seq, qual))
    return reads

with open('fastq_single_end_short.fq','r') as fq:
    reads = parse_fastq(fq)

for read in reads:
    print(read)

('ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1', 'AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG', 'BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF')
('ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1', 'CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT', '@@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF')
('ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1', 'GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT', 'CDAF<FFDEHEFDDFEEFDGDFCHD=GHG<GEDHDGJFHEFFGEFEE@GH')

The nucleotide string can sometimes contain the character “N”. N essentially means “no confidence.” The sequencer knows there’s a nucleotide there but doesn’t know whether it’s an A, C, G or T.

📝 A note on while True: In Python, the while loop is used to repeatedly execute a block of code as long as a certain condition is true. The while True statement is a special case where the loop will run indefinitely until a break statement is encountered inside the loop. It is important to be careful when using while True loops, as they will run indefinitely if a break statement is not included. This can cause the program to crash or hang, if not handled properly.

💡 Production tip: In real-world applications, use BioPython’s SeqIO.parse() instead of writing your own parser. It handles edge cases and supports many file formats.

Read name

Read names often contain information about:

The scientific study for which the read was sequenced. E.g. the string ERR294379 (an SRA accession number) in the read names correspond to this study.
The sequencing instrument, and the exact part of the sequencing instrument, where the DNA was sequenced. See the FASTQ format Wikipedia article for specifics on how the Illumina software encodes this information.
Whether the read is part of a paired-end read and, if so, which end it is. The /1 you see at the end of the read names above indicates the read is the first end from a paired-end read.

Quality values

Quality values are probabilities. Each nucleotide in each sequencing read has an associated quality value. A nucleotide quality value encodes the probability that the nucleotide was incorrectly called by the sequencing instrument and its software. If the nucleotide is A, the corresponding quality value encodes the probability that the nucleotide at that position is actually not an A.

Quality values are encoded in two senses: first, the relevant probabilities are re-scaled using the Phred scale, which is a negative log scale. In other words if p is the probability that the nucleotide was incorrectly called, we encode this as Q where Q = -10 * log10(p).

💡 Why Phred scale? Error probabilities span a huge range (0.1 to 0.00001). The log transformation compresses this into manageable integers (10 to 50) that fit in a single ASCII character.

For example: - If Q = 30, then p = 0.001, a 1-in-1000 chance that the nucleotide is wrong - If Q = 20, then p = 0.01, a 1-in-100 chance - If Q = 10, then p = 0.1, a 1-in-10 chance

Second, scaled quality values are rounded to the nearest integer and encoded using ASCII printable characters. For example, using the Phred33 encoding (which is by far the most common), a Q of 30 is encoded as the ASCII character with code 33 + 30 = 63, which is “?”. A Q of 20 is encoded as the ASCII character with code 33 + 20 = 53, which is “5”.

Let’s define some relevant Python functions:

Understanding the quality conversion functions

These four functions convert between different representations of quality scores. Here’s what each one does:

📝 New Python concepts: ord() and chr() are built-in functions for working with ASCII characters. Exponentiation (**) raises a number to a power.

phred33_to_q() — Character → Q score. ord(qual) returns the ASCII code number for a character (e.g., ord('?') → 63). Subtract 33 to get Q.
q_to_phred33() — Q score → Character. chr(Q + 33) does the reverse—converts a number to its ASCII character (e.g., chr(63) → ‘?’).
q_to_p() — Q score → Probability. 10.0 ** (-0.1 * Q) is the formula p = 10^(-Q/10). We use 10.0 (float) to ensure decimal division.
p_to_q() — Probability → Q score. math.log10(p) computes log base 10. int(round(...)) rounds to nearest integer then converts to int.

Example walkthrough for Q=30:

q_to_p(30) = 10.0 ** (-0.1 * 30) = 10.0 ** (-3) = 0.001

This means a 1-in-1000 chance the base call is wrong.

def phred33_to_q(qual):
    """ Turn Phred+33 ASCII-encoded quality into Phred-scaled integer """
    return ord(qual)-33

def q_to_phred33(Q):
    """ Turn Phred-scaled integer into Phred+33 ASCII-encoded quality """
    return chr(Q + 33)

def q_to_p(Q):
    """ Turn Phred-scaled integer into error probability """
    return 10.0 ** (-0.1 * Q)

def p_to_q(p):
    """ Turn error probability into Phred-scaled integer """
    import math
    return int(round(-10.0 * math.log10(p)))

# Here are the examples discussed above

# Convert Qs into ps
q_to_p(30), q_to_p(20), q_to_p(10)

(0.001, 0.01, 0.1)

p_to_q(0.00011) # note that the result is rounded

q_to_phred33(30), q_to_phred33(20)

('?', '5')

To convert an entire string of Phred33-encoded quality values into the corresponding Q or p values, we can do the following.

Understanding this code block

📝 New Python concepts: StringIO lets you treat a string as if it were a file. Tuple unpacking assigns multiple variables at once. map() applies a function to every element.

📚 Think Python: Ch. 9 - Lists | Ch. 11 - Tuples (tuple assignment)

Line 1: from io import StringIO — Imports StringIO from the io module—lets us treat a string like a file.
Lines 3-7: fastq_string = '''...''' — Triple quotes create a multi-line string containing one FASTQ record.
Line 10: parse_fastq(StringIO(fastq_string)) — StringIO(fastq_string) wraps our string so parse_fastq() can read it like a file.
Line 10: ...[0] — Gets the first (and only) read from the returned list. (Ch. 9: Indexing)
Line 10: name, seq, qual = ... — Tuple unpacking—the tuple (name, seq, qual) is split into 3 separate variables. (Ch. 11: Tuple Assignment)
Line 11: list(map(phred33_to_q, qual)) — map() applies phred33_to_q to each character in qual. list() converts the result to a list.
Line 12: list(map(q_to_p, q_string)) — Same pattern—applies q_to_p to each Q score.

Why StringIO? Our parse_fastq() function expects a file handle (something it can call .readline() on). StringIO makes a string behave like a file, so we can test our parser without creating an actual file.

from io import StringIO

fastq_string = '''@ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1
AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG
+
BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF
'''

# Take the first read from the small example above
name, seq, qual = parse_fastq(StringIO(fastq_string))[0]
q_string = list(map(phred33_to_q, qual))
p_string = list(map(q_to_p, q_string))
print(q_string)
print(p_string)

[33, 35, 35, 36, 36, 37, 30, 37, 38, 37, 37, 37, 39, 38, 37, 37, 39, 39, 38, 39, 38, 38, 39, 34, 39, 31, 38, 39, 39, 39, 38, 37, 32, 39, 36, 38, 37, 36, 39, 38, 36, 37, 38, 39, 34, 34, 38, 38, 38, 37]
[0.000501187233627272, 0.00031622776601683794, 0.00031622776601683794, 0.00025118864315095795, 0.00025118864315095795, 0.00019952623149688788, 0.001, 0.00019952623149688788, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.0001258925411794166, 0.00015848931924611126, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0001258925411794166, 0.0007943282347242813, 0.00015848931924611126, 0.0001258925411794166, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.000630957344480193, 0.0001258925411794166, 0.00025118864315095795, 0.00015848931924611126, 0.00019952623149688788, 0.00025118864315095795, 0.0001258925411794166, 0.00015848931924611126, 0.00025118864315095795, 0.00019952623149688788, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0003981071705534969, 0.00015848931924611126, 0.00015848931924611126, 0.00015848931924611126, 0.00019952623149688788]

You might wonder how the sequencer and its software can know the probability that a nucleotide is incorrectly called. It can’t; this number is just an estimate. To describe exactly how it’s estimated is beyond the scope of this notebook; if you’re interested, search for academic papers with “base calling” in the title. Here’s a helpful video by Rafa Irizarry.

A final note: other ways of encoding quality values were proposed and used in the past. For example, Phred64 uses an ASCII offset of 64 instead of 33, and Solexa64 uses “odds” instead of the probability p. But Phred33 is by far the most common today and you will likely never have to worry about this.

📝 A note on map(): In Python, the map() function is used to apply a given function to all elements of an iterable (such as a list, tuple, or string) and return an iterator that yields the results. The map() function takes two arguments: a function that is to be applied to each element of the iterable, and an iterable on which the function is to be applied. Example: squared_numbers = map(lambda x: x**2, [1, 2, 3, 4, 5]) returns [1, 4, 9, 16, 25].

How good are my reads?

Let’s analyze the quality distribution across all reads in our FASTQ file. First, reload the reads:

# Reload reads from file
with open('fastq_single_end_short.fq','r') as fq:
    reads = parse_fastq(fq)

Understanding nested list comprehensions

The next cell uses a nested list comprehension—a compact way to create a list of lists. Let’s break it down:

📝 New Python concept: List comprehensions are a concise way to create lists. The syntax [expression for item in iterable] creates a new list by applying expression to each item.

📚 Think Python: Ch. 9 - Lists | Ch. 11 - Tuples (tuple indexing)

The code: [[phred33_to_q(q) for q in read[2]] for read in reads]

Step-by-step breakdown:

Outer loop: for read in reads — iterates through each read tuple (Ch. 9: Iteration)
Access quality string: read[2] — gets the 3rd element (quality string) from the tuple (Ch. 11: Tuple indexing)
Inner loop: for q in read[2] — iterates through each character in the quality string
Transform: phred33_to_q(q) — converts each character to a Q score

Equivalent using regular loops:

run_qualities = []
for read in reads:                    # outer loop
    qual_scores = []
    for q in read[2]:                 # inner loop over quality string
        qual_scores.append(phred33_to_q(q))
    run_qualities.append(qual_scores)

Result structure: A list of lists, where each inner list contains the Q scores for one read. (Ch. 9: Nested lists)

# Extract qualities and convert to numerical values using list comprehension
run_qualities = [[phred33_to_q(q) for q in read[2]] for read in reads]

Understanding NumPy arrays and transpose

We need to transpose this matrix so that instead of rows = reads and columns = positions, we get rows = positions and columns = reads. This lets us compute statistics per position across all reads:

Before (rows=reads):     After (rows=positions):
Read1: [Q1, Q2, Q3]      Pos1: [Q1_r1, Q1_r2, Q1_r3]
Read2: [Q1, Q2, Q3]  =>  Pos2: [Q2_r1, Q2_r2, Q2_r3]
Read3: [Q1, Q2, Q3]      Pos3: [Q3_r1, Q3_r2, Q3_r3]

📝 New Python concepts: NumPy is a library for numerical computing. np.array() converts lists to arrays. .T transposes (swaps rows and columns).

import numpy as np — Imports NumPy library, conventionally aliased as np.
np.array(run_qualities) — Converts our list of lists into a NumPy array—a grid-like structure optimized for math operations.
.T — The transpose attribute—swaps rows and columns. Equivalent to np.transpose(array).

Why NumPy? Python lists are flexible but slow for math. NumPy arrays are optimized for numerical operations and can compute statistics across entire rows/columns efficiently.

# Transpose using numpy
import numpy as np
base_qualities = np.array(run_qualities).T

Understanding pandas DataFrames

Now we’ll compute summary statistics for each position and store them in a pandas DataFrame—a table-like structure perfect for data analysis.

📝 New Python concepts: pandas is a data analysis library. pd.DataFrame() creates a table from a dictionary. range() generates a sequence of numbers.

📚 Think Python: Ch. 9 - Lists (list comprehensions) | Ch. 10 - Dictionaries (dict syntax)

import pandas as pd — Imports pandas, conventionally aliased as pd.
pd.DataFrame({...}) — Creates a table where each key becomes a column name and each value becomes column data. (Ch. 10: Dictionaries)
range(len(base_qualities)) — Generates numbers 0, 1, 2, … up to the number of positions. These become row indices.
[np.mean(b) for b in base_qualities] — List comprehension that computes the mean of each row (position) in our transposed array. (Ch. 9: Lists)
np.quantile(b, .25) — Computes the 25th percentile—the value below which 25% of data falls.

NumPy statistics functions:

np.mean(b) — Average (sum divided by count)
np.median(b) — Middle value when sorted
np.quantile(b, .25) — 25th percentile
np.quantile(b, .75) — 75th percentile
np.min(b), np.max(b) — Smallest and largest values

# Compute per-position statistics using pandas
import pandas as pd

stats_df = pd.DataFrame({
    'base': range(len(base_qualities)),
    'mean': [np.mean(b) for b in base_qualities],
    'median': [np.median(b) for b in base_qualities],
    'q25': [np.quantile(b, .25) for b in base_qualities],
    'q75': [np.quantile(b, .75) for b in base_qualities],
    'min': [np.min(b) for b in base_qualities],
    'max': [np.max(b) for b in base_qualities]
})

Understanding Altair visualization

The plot below shows quality distribution per position—similar to FastQC output. Each position shows:

Black line: min to max range
Green bar: interquartile range (25th to 75th percentile)
Red tick: median quality

📝 New Python concepts: Altair is a declarative visualization library. You describe what you want to show, not how to draw it. Charts are built by chaining methods.

import altair as alt — Imports Altair, conventionally aliased as alt.
alt.Chart(stats_df) — Creates a chart object from our DataFrame.
.encode(alt.X('base:Q', ...)) — Maps the base column to the x-axis. :Q means “quantitative” (numeric).
.properties(width=800, height=200) — Sets the chart dimensions in pixels.
.mark_tick() — Draws small horizontal ticks (for median).
.mark_rule() — Draws lines/bars (for ranges).
alt.Y('median:Q') — Maps median column to y-axis.
alt.Y2('q75:Q') — Sets the end point for a range (used with mark_rule).
median + q + min_max — The + operator layers multiple charts on top of each other.

# Plot quality per position
import altair as alt

base = alt.Chart(stats_df).encode(
    alt.X('base:Q', title="Position in the read")
).properties(
    width=800,
    height=200)

median = base.mark_tick(color='red',orient='horizontal').encode(
    alt.Y('median:Q', title="Phred quality score"),
)

q = base.mark_rule(color='green',opacity=.5,strokeWidth=10).encode(
    alt.Y('q25:Q'),
    alt.Y2('q75:Q')
)

min_max = base.mark_rule(color='black').encode(
        alt.Y('min:Q'),
        alt.Y2('max:Q')
)

median + q + min_max

Other comments

In all the examples above, the reads in the FASTQ file are all the same length. This is not necessarily the case though it is usually true for datasets generated by sequencing-by-synthesis instruments. FASTQ files can contain reads of various lengths.

FASTQ files often have the extension .fastq or .fq.

Homework

10 min to Pandas