#!curl -sLO https://zenodo.org/records/10602772/files/fastq_single_end_long.fq
!curl -sLO https://zenodo.org/records/10602772/files/fastq_single_end_short.fqLecture 6: FASTQ and Quality Scores
💡 Attribution: This material uses examples from notebooks developed by Ben Langmead
📚 Python concepts in this lecture
This lecture applies concepts from Think Python (3rd edition):
Chapter Concepts used in this lecture Ch. 9 - Lists Creating lists, append(), slicing[1:], iteration withfor, nested lists, list comprehensionsCh. 10 - Dictionaries Dictionary syntax {'key': value}, lists as dictionary values (used in pandas DataFrames)Ch. 11 - Tuples Creating tuples (a, b, c), tuple unpackingname, seq, qual = ..., tuple indexingread[2]Look for (Ch. N) annotations in the code explanations below.
FASTQ
This notebook explores FASTQ, the most common format for storing sequencing reads.
FASTA and FASTQ are rather similar, but FASTQ is almost always used for storing sequencing reads (with associated quality values), whereas FASTA is used for storing all kinds of DNA, RNA or protein sequences (without associated quality values).
Reading FASTQ with python
Download a sample small FASTQ file:
Now we will use a very simple Python function to read this file and load fastq data into a list.
Understanding the parse_fastq() function
📝 New Python concepts: This function introduces function definitions, while loops, string methods, and file handling.
📚 Think Python: Ch. 9 - Lists (lists, append) | Ch. 11 - Tuples (tuples)
Line-by-line breakdown:
- Line 1:
def parse_fastq(fh):—defcreates a function namedparse_fastq.fhis a parameter—a placeholder for whatever file handle we pass in when calling the function. - Line 3:
reads = []— Creates an empty list to store our results. (Ch. 9: Lists) - Line 4:
while True:— Starts an infinite loop. We’ll usebreakto exit when done. - Line 5:
first_line = fh.readline()—readline()reads one line from the file and moves to the next. - Line 6:
if len(first_line) == 0:— When we reach end-of-file,readline()returns an empty string (length 0). - Line 7:
break— Exits thewhileloop immediately. - Lines 8-9:
if not first_line.startswith('@'):—startswith('@')returnsTrueif the string begins with@. If not, something’s wrong with the file. - Line 10:
name = first_line[1:].rstrip()—[1:]is slicing—takes everything from position 1 onward (skips the@).rstrip()removes trailing whitespace/newlines. (Ch. 9: Slicing) - Lines 11-13:
seq = fh.readline().rstrip()— Reads the sequence line, the+line (ignored), and quality line. - Line 14:
reads.append((name, seq, qual))— Creates a tuple(name, seq, qual)and adds it to our list. (Ch. 9: append; Ch. 11: Tuples) - Line 15:
return reads— Sends the completed list back to whoever called the function. - Line 17:
with open(...) as fq:— Context manager—opens file safely and auto-closes it when done. - Line 20:
for read in reads:— Loops through each tuple in the list. (Ch. 9: Iteration)
def parse_fastq(fh):
""" Parse reads from a FASTQ filehandle. For each read, we
return a name, nucleotide-string, quality-string triple. """
reads = []
while True:
first_line = fh.readline()
if len(first_line) == 0:
break # end of file
if not first_line.startswith('@'):
raise ValueError(f"Expected '@' at start of FASTQ record, got: {first_line[:20]!r}")
name = first_line[1:].rstrip()
seq = fh.readline().rstrip()
fh.readline() # ignore line starting with +
qual = fh.readline().rstrip()
reads.append((name, seq, qual))
return reads
with open('fastq_single_end_short.fq','r') as fq:
reads = parse_fastq(fq)
for read in reads:
print(read)('ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1', 'AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG', 'BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF')
('ERR294379.136275489 HS24_09441:8:2311:1917:99340#42/1', 'CTTAAGTATTTTGAAAGTTAACATAAGTTATTCTCAGAGAGACTGCTTTT', '@@AHFF?EEDEAF?FEEGEFD?GGFEFGECGE?9H?EEABFAG9@CDGGF')
('ERR294379.97291341 HS24_09441:8:2201:10397:52549#42/1', 'GGCTGCCATCAGTGAGCAAGTAAGAATTTGCAGAAATTTATTAGCACACT', 'CDAF<FFDEHEFDDFEEFDGDFCHD=GHG<GEDHDGJFHEFFGEFEE@GH')
The nucleotide string can sometimes contain the character “N”. N essentially means “no confidence.” The sequencer knows there’s a nucleotide there but doesn’t know whether it’s an A, C, G or T.
📝 A note on
while True: In Python, the while loop is used to repeatedly execute a block of code as long as a certain condition is true. Thewhile Truestatement is a special case where the loop will run indefinitely until abreakstatement is encountered inside the loop. It is important to be careful when using while True loops, as they will run indefinitely if a break statement is not included. This can cause the program to crash or hang, if not handled properly.
💡 Production tip: In real-world applications, use BioPython’s
SeqIO.parse()instead of writing your own parser. It handles edge cases and supports many file formats.
Read name
Read names often contain information about:
- The scientific study for which the read was sequenced. E.g. the string
ERR294379(an SRA accession number) in the read names correspond to this study. - The sequencing instrument, and the exact part of the sequencing instrument, where the DNA was sequenced. See the FASTQ format Wikipedia article for specifics on how the Illumina software encodes this information.
- Whether the read is part of a paired-end read and, if so, which end it is. The
/1you see at the end of the read names above indicates the read is the first end from a paired-end read.
Quality values
Quality values are probabilities. Each nucleotide in each sequencing read has an associated quality value. A nucleotide quality value encodes the probability that the nucleotide was incorrectly called by the sequencing instrument and its software. If the nucleotide is A, the corresponding quality value encodes the probability that the nucleotide at that position is actually not an A.
Quality values are encoded in two senses: first, the relevant probabilities are re-scaled using the Phred scale, which is a negative log scale. In other words if p is the probability that the nucleotide was incorrectly called, we encode this as Q where Q = -10 * log10(p).
💡 Why Phred scale? Error probabilities span a huge range (0.1 to 0.00001). The log transformation compresses this into manageable integers (10 to 50) that fit in a single ASCII character.
For example: - If Q = 30, then p = 0.001, a 1-in-1000 chance that the nucleotide is wrong - If Q = 20, then p = 0.01, a 1-in-100 chance - If Q = 10, then p = 0.1, a 1-in-10 chance
Second, scaled quality values are rounded to the nearest integer and encoded using ASCII printable characters. For example, using the Phred33 encoding (which is by far the most common), a Q of 30 is encoded as the ASCII character with code 33 + 30 = 63, which is “?”. A Q of 20 is encoded as the ASCII character with code 33 + 20 = 53, which is “5”.
Let’s define some relevant Python functions:
Understanding the quality conversion functions
These four functions convert between different representations of quality scores. Here’s what each one does:
📝 New Python concepts:
ord()andchr()are built-in functions for working with ASCII characters. Exponentiation (**) raises a number to a power.
phred33_to_q()— Character → Q score.ord(qual)returns the ASCII code number for a character (e.g.,ord('?')→ 63). Subtract 33 to get Q.q_to_phred33()— Q score → Character.chr(Q + 33)does the reverse—converts a number to its ASCII character (e.g.,chr(63)→ ‘?’).q_to_p()— Q score → Probability.10.0 ** (-0.1 * Q)is the formula p = 10^(-Q/10). We use10.0(float) to ensure decimal division.p_to_q()— Probability → Q score.math.log10(p)computes log base 10.int(round(...))rounds to nearest integer then converts to int.
Example walkthrough for Q=30:
q_to_p(30) = 10.0 ** (-0.1 * 30) = 10.0 ** (-3) = 0.001
This means a 1-in-1000 chance the base call is wrong.
def phred33_to_q(qual):
""" Turn Phred+33 ASCII-encoded quality into Phred-scaled integer """
return ord(qual)-33
def q_to_phred33(Q):
""" Turn Phred-scaled integer into Phred+33 ASCII-encoded quality """
return chr(Q + 33)
def q_to_p(Q):
""" Turn Phred-scaled integer into error probability """
return 10.0 ** (-0.1 * Q)
def p_to_q(p):
""" Turn error probability into Phred-scaled integer """
import math
return int(round(-10.0 * math.log10(p)))# Here are the examples discussed above
# Convert Qs into ps
q_to_p(30), q_to_p(20), q_to_p(10)(0.001, 0.01, 0.1)
p_to_q(0.00011) # note that the result is rounded40
q_to_phred33(30), q_to_phred33(20)('?', '5')
To convert an entire string of Phred33-encoded quality values into the corresponding Q or p values, we can do the following.
Understanding this code block
📝 New Python concepts:
StringIOlets you treat a string as if it were a file. Tuple unpacking assigns multiple variables at once.map()applies a function to every element.📚 Think Python: Ch. 9 - Lists | Ch. 11 - Tuples (tuple assignment)
- Line 1:
from io import StringIO— ImportsStringIOfrom theiomodule—lets us treat a string like a file. - Lines 3-7:
fastq_string = '''...'''— Triple quotes create a multi-line string containing one FASTQ record. - Line 10:
parse_fastq(StringIO(fastq_string))—StringIO(fastq_string)wraps our string soparse_fastq()can read it like a file. - Line 10:
...[0]— Gets the first (and only) read from the returned list. (Ch. 9: Indexing) - Line 10:
name, seq, qual = ...— Tuple unpacking—the tuple(name, seq, qual)is split into 3 separate variables. (Ch. 11: Tuple Assignment) - Line 11:
list(map(phred33_to_q, qual))—map()appliesphred33_to_qto each character inqual.list()converts the result to a list. - Line 12:
list(map(q_to_p, q_string))— Same pattern—appliesq_to_pto each Q score.
Why StringIO? Our parse_fastq() function expects a file handle (something it can call .readline() on). StringIO makes a string behave like a file, so we can test our parser without creating an actual file.
from io import StringIO
fastq_string = '''@ERR294379.100739024 HS24_09441:8:2203:17450:94030#42/1
AGGGAGTCCACAGCACAGTCCAGACTCCCACCAGTTCTGACGAAATGATG
+
BDDEEF?FGFFFHGFFHHGHGGHCH@GHHHGFAHEGFEHGEFGHCCGGGF
'''
# Take the first read from the small example above
name, seq, qual = parse_fastq(StringIO(fastq_string))[0]
q_string = list(map(phred33_to_q, qual))
p_string = list(map(q_to_p, q_string))
print(q_string)
print(p_string)[33, 35, 35, 36, 36, 37, 30, 37, 38, 37, 37, 37, 39, 38, 37, 37, 39, 39, 38, 39, 38, 38, 39, 34, 39, 31, 38, 39, 39, 39, 38, 37, 32, 39, 36, 38, 37, 36, 39, 38, 36, 37, 38, 39, 34, 34, 38, 38, 38, 37]
[0.000501187233627272, 0.00031622776601683794, 0.00031622776601683794, 0.00025118864315095795, 0.00025118864315095795, 0.00019952623149688788, 0.001, 0.00019952623149688788, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.00019952623149688788, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.0001258925411794166, 0.00015848931924611126, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0001258925411794166, 0.0007943282347242813, 0.00015848931924611126, 0.0001258925411794166, 0.0001258925411794166, 0.0001258925411794166, 0.00015848931924611126, 0.00019952623149688788, 0.000630957344480193, 0.0001258925411794166, 0.00025118864315095795, 0.00015848931924611126, 0.00019952623149688788, 0.00025118864315095795, 0.0001258925411794166, 0.00015848931924611126, 0.00025118864315095795, 0.00019952623149688788, 0.00015848931924611126, 0.0001258925411794166, 0.0003981071705534969, 0.0003981071705534969, 0.00015848931924611126, 0.00015848931924611126, 0.00015848931924611126, 0.00019952623149688788]
You might wonder how the sequencer and its software can know the probability that a nucleotide is incorrectly called. It can’t; this number is just an estimate. To describe exactly how it’s estimated is beyond the scope of this notebook; if you’re interested, search for academic papers with “base calling” in the title. Here’s a helpful video by Rafa Irizarry.
A final note: other ways of encoding quality values were proposed and used in the past. For example, Phred64 uses an ASCII offset of 64 instead of 33, and Solexa64 uses “odds” instead of the probability p. But Phred33 is by far the most common today and you will likely never have to worry about this.
📝 A note on
map(): In Python, themap()function is used to apply a given function to all elements of an iterable (such as a list, tuple, or string) and return an iterator that yields the results. Themap()function takes two arguments: a function that is to be applied to each element of the iterable, and an iterable on which the function is to be applied. Example:squared_numbers = map(lambda x: x**2, [1, 2, 3, 4, 5])returns[1, 4, 9, 16, 25].
How good are my reads?
Let’s analyze the quality distribution across all reads in our FASTQ file. First, reload the reads:
# Reload reads from file
with open('fastq_single_end_short.fq','r') as fq:
reads = parse_fastq(fq)Understanding nested list comprehensions
The next cell uses a nested list comprehension—a compact way to create a list of lists. Let’s break it down:
📝 New Python concept: List comprehensions are a concise way to create lists. The syntax
[expression for item in iterable]creates a new list by applyingexpressionto eachitem.📚 Think Python: Ch. 9 - Lists | Ch. 11 - Tuples (tuple indexing)
The code: [[phred33_to_q(q) for q in read[2]] for read in reads]
Step-by-step breakdown:
- Outer loop:
for read in reads— iterates through each read tuple (Ch. 9: Iteration) - Access quality string:
read[2]— gets the 3rd element (quality string) from the tuple (Ch. 11: Tuple indexing) - Inner loop:
for q in read[2]— iterates through each character in the quality string - Transform:
phred33_to_q(q)— converts each character to a Q score
Equivalent using regular loops:
run_qualities = []
for read in reads: # outer loop
qual_scores = []
for q in read[2]: # inner loop over quality string
qual_scores.append(phred33_to_q(q))
run_qualities.append(qual_scores)Result structure: A list of lists, where each inner list contains the Q scores for one read. (Ch. 9: Nested lists)
# Extract qualities and convert to numerical values using list comprehension
run_qualities = [[phred33_to_q(q) for q in read[2]] for read in reads]Understanding NumPy arrays and transpose
We need to transpose this matrix so that instead of rows = reads and columns = positions, we get rows = positions and columns = reads. This lets us compute statistics per position across all reads:
Before (rows=reads): After (rows=positions):
Read1: [Q1, Q2, Q3] Pos1: [Q1_r1, Q1_r2, Q1_r3]
Read2: [Q1, Q2, Q3] => Pos2: [Q2_r1, Q2_r2, Q2_r3]
Read3: [Q1, Q2, Q3] Pos3: [Q3_r1, Q3_r2, Q3_r3]
📝 New Python concepts: NumPy is a library for numerical computing.
np.array()converts lists to arrays..Ttransposes (swaps rows and columns).
import numpy as np— Imports NumPy library, conventionally aliased asnp.np.array(run_qualities)— Converts our list of lists into a NumPy array—a grid-like structure optimized for math operations..T— The transpose attribute—swaps rows and columns. Equivalent tonp.transpose(array).
Why NumPy? Python lists are flexible but slow for math. NumPy arrays are optimized for numerical operations and can compute statistics across entire rows/columns efficiently.
# Transpose using numpy
import numpy as np
base_qualities = np.array(run_qualities).TUnderstanding pandas DataFrames
Now we’ll compute summary statistics for each position and store them in a pandas DataFrame—a table-like structure perfect for data analysis.
📝 New Python concepts: pandas is a data analysis library.
pd.DataFrame()creates a table from a dictionary.range()generates a sequence of numbers.📚 Think Python: Ch. 9 - Lists (list comprehensions) | Ch. 10 - Dictionaries (dict syntax)
import pandas as pd— Imports pandas, conventionally aliased aspd.pd.DataFrame({...})— Creates a table where each key becomes a column name and each value becomes column data. (Ch. 10: Dictionaries)range(len(base_qualities))— Generates numbers 0, 1, 2, … up to the number of positions. These become row indices.[np.mean(b) for b in base_qualities]— List comprehension that computes the mean of each row (position) in our transposed array. (Ch. 9: Lists)np.quantile(b, .25)— Computes the 25th percentile—the value below which 25% of data falls.
NumPy statistics functions:
np.mean(b)— Average (sum divided by count)np.median(b)— Middle value when sortednp.quantile(b, .25)— 25th percentilenp.quantile(b, .75)— 75th percentilenp.min(b),np.max(b)— Smallest and largest values
# Compute per-position statistics using pandas
import pandas as pd
stats_df = pd.DataFrame({
'base': range(len(base_qualities)),
'mean': [np.mean(b) for b in base_qualities],
'median': [np.median(b) for b in base_qualities],
'q25': [np.quantile(b, .25) for b in base_qualities],
'q75': [np.quantile(b, .75) for b in base_qualities],
'min': [np.min(b) for b in base_qualities],
'max': [np.max(b) for b in base_qualities]
})Understanding Altair visualization
The plot below shows quality distribution per position—similar to FastQC output. Each position shows:
- Black line: min to max range
- Green bar: interquartile range (25th to 75th percentile)
- Red tick: median quality
📝 New Python concepts: Altair is a declarative visualization library. You describe what you want to show, not how to draw it. Charts are built by chaining methods.
import altair as alt— Imports Altair, conventionally aliased asalt.alt.Chart(stats_df)— Creates a chart object from our DataFrame..encode(alt.X('base:Q', ...))— Maps thebasecolumn to the x-axis.:Qmeans “quantitative” (numeric)..properties(width=800, height=200)— Sets the chart dimensions in pixels..mark_tick()— Draws small horizontal ticks (for median)..mark_rule()— Draws lines/bars (for ranges).alt.Y('median:Q')— Mapsmediancolumn to y-axis.alt.Y2('q75:Q')— Sets the end point for a range (used withmark_rule).median + q + min_max— The+operator layers multiple charts on top of each other.
# Plot quality per position
import altair as alt
base = alt.Chart(stats_df).encode(
alt.X('base:Q', title="Position in the read")
).properties(
width=800,
height=200)
median = base.mark_tick(color='red',orient='horizontal').encode(
alt.Y('median:Q', title="Phred quality score"),
)
q = base.mark_rule(color='green',opacity=.5,strokeWidth=10).encode(
alt.Y('q25:Q'),
alt.Y2('q75:Q')
)
min_max = base.mark_rule(color='black').encode(
alt.Y('min:Q'),
alt.Y2('max:Q')
)
median + q + min_maxOther comments
In all the examples above, the reads in the FASTQ file are all the same length. This is not necessarily the case though it is usually true for datasets generated by sequencing-by-synthesis instruments. FASTQ files can contain reads of various lengths.
FASTQ files often have the extension .fastq or .fq.
