Open In Colab

Lecture 5: Parsing FASTA Files

Working with Biological Sequences

Working with biological sequence data using strings, conditionals, and iteration

This lecture demonstrates concepts from Think Python chapters 5, 7, and 8.

Methods vs Functions: A function is called by name: len(seq), print(x). A method is called on an object using dot notation: seq.count('A'), line.startswith('>'). Methods are functions that “belong to” a specific type—string methods work on strings, list methods work on lists, etc.

What is FASTA format?

FASTA is one of the most common formats for storing biological sequences. Each sequence entry consists of:

A header line starting with >, followed by the sequence name and optional description
One or more lines containing the sequence itself

Here’s an example with three DNA sequences:

fasta_data = """>seq1 E. coli promoter region
ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
>seq2 Yeast regulatory element
ATATATATGCATGCATATATGCATGCATATATATAT
>seq3 Human CpG island
GCGCGCATATATGCGCGCATATGCGCGCGCATATAT
"""

print(fasta_data)

>seq1 E. coli promoter region
ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
>seq2 Yeast regulatory element
ATATATATGCATGCATATATGCATGCATATATATAT
>seq3 Human CpG island
GCGCGCATATATGCGCGCATATGCGCGCGCATATAT

Note on triple quotes: In Python, triple quotes (""" or ''') allow you to create multi-line strings. This is perfect for storing FASTA data directly in your code without needing to read from a file.

Our Goal: A Dictionary

We want to convert this FASTA data into a Python dictionary where: - Keys are sequence names (e.g., 'seq1', 'seq2', 'seq3') - Values are the corresponding sequences

{
    'seq1': 'ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG',
    'seq2': 'ATATATATGCATGCATATATGCATGCATATATATAT',
    'seq3': 'GCGCGCATATATGCGCGCATATGCGCGCGCATATAT'
}

Why a dictionary? Because it gives us fast lookup by name. If we want the sequence for seq2, we simply write sequences['seq2'].

Formatted String Literals (f-strings)

Before we start parsing, let’s learn about f-strings—a convenient way to embed variables and expressions inside strings. An f-string starts with f before the opening quote, and uses curly braces {} to insert values:

# Basic f-string usage
name = "seq1"
length = 42

# Without f-strings (concatenation)
print("Sequence " + name + " has length " + str(length))

# With f-strings (cleaner!)
print(f"Sequence {name} has length {length}")

Sequence seq1 has length 42
Sequence seq1 has length 42

# F-strings can contain expressions, not just variables
sequence = "ATGCGATC"

print(f"Sequence: {sequence}")
print(f"Length: {len(sequence)}")
print(f"First 4 bases: {sequence[:4]}")
print(f"GC count: {sequence.count('G') + sequence.count('C')}")

Sequence: ATGCGATC
Length: 8
First 4 bases: ATGC
GC count: 4

Why f-strings? They’re more readable than string concatenation ("text" + var + "more"), and you don’t need to manually convert numbers to strings with str(). We’ll use f-strings throughout this lecture for output.

Parsing Logic: Step by Step

Step 1: Split into lines

First, we split our FASTA string into individual lines. We use two string methods chained together:

.strip() removes any leading/trailing whitespace (spaces, tabs, newlines) from the string. This is important because our multi-line string might have extra blank lines at the beginning or end.
.split('\n') breaks the string at each newline character (\n), returning a list of strings—one for each line.

The \n is an escape sequence representing a newline (line break). When you press Enter in a text file, it inserts a \n character.

# Let's see what strip() does
text_with_whitespace = "   \n\n  Hello World  \n\n   "
print(f"Before strip: '{text_with_whitespace}'")
print(f"After strip:  '{text_with_whitespace.strip()}'")

# And what split('\n') does
multiline = "Line 1\nLine 2\nLine 3"
print(f"\nOriginal: {multiline}")
print(f"After split: {multiline.split('\n')}")

Before strip: '   

  Hello World  

   '
After strip:  'Hello World'

Original: Line 1
Line 2
Line 3
After split: ['Line 1', 'Line 2', 'Line 3']

Now let’s apply this to our FASTA data. The enumerate() function gives us both the index and value as we loop:

lines = fasta_data.strip().split('\n')
for i, line in enumerate(lines):
    print(f"Line {i}: {line}")

Line 0: >seq1 E. coli promoter region
Line 1: ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
Line 2: >seq2 Yeast regulatory element
Line 3: ATATATATGCATGCATATATGCATGCATATATATAT
Line 4: >seq3 Human CpG island
Line 5: GCGCGCATATATGCGCGCATATGCGCGCGCATATAT

Step 2: Identify header lines

Header lines start with >. We can detect them using the .startswith() method, which returns True if a string begins with the specified prefix, and False otherwise.

This is cleaner than checking line[0] == '>' because: - It’s more readable - It won’t crash on empty strings (whereas line[0] would raise an IndexError)

for line in lines:
    if line.startswith('>'):
        print(f"HEADER: {line}")
    else:
        print(f"SEQUENCE: {line}")

HEADER: >seq1 E. coli promoter region
SEQUENCE: ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
HEADER: >seq2 Yeast regulatory element
SEQUENCE: ATATATATGCATGCATATATGCATGCATATATATAT
HEADER: >seq3 Human CpG island
SEQUENCE: GCGCGCATATATGCGCGCATATGCGCGCGCATATAT

Step 3: Extract the sequence name

The sequence name is the first “word” after the >. We can extract it by: 1. Removing the > character (using line[1:] to skip the first character) 2. Splitting on whitespace and taking the first element

About line[1:]: This is called slicing. The syntax string[start:end] extracts characters from index start up to (but not including) index end. When we omit end, it means “go to the end of the string.” So line[1:] means “everything from index 1 onwards”—i.e., skip the first character (the >).

About .split(): When called with no arguments, .split() splits on any whitespace (spaces, tabs, multiple spaces) and removes empty strings. This is more robust than .split(' ') which only splits on single spaces.

header = ">seq1 E. coli promoter region"

# Remove the '>' character
without_gt = header[1:]
print(f"After removing '>': {without_gt}")

# Split on whitespace
parts = without_gt.split()
print(f"Split into parts: {parts}")

# Take the first part (the name)
name = parts[0]
print(f"Sequence name: {name}")

After removing '>': seq1 E. coli promoter region
Split into parts: ['seq1', 'E.', 'coli', 'promoter', 'region']
Sequence name: seq1

First Attempt: A Naive Parser

Now let’s put it together. The basic idea is: - When we see a >, save the previous sequence (if any) and start a new one - When we see a sequence line, append it to the current sequence

Two important patterns to notice:

Truthiness test: if current_name: checks if current_name has a “truthy” value. In Python, None and empty strings "" are “falsy” (evaluate to False), while non-empty strings are “truthy”. This is cleaner than writing if current_name != None and current_name != "":
Augmented assignment: current_seq += line is shorthand for current_seq = current_seq + line. It appends line to the end of current_seq.

Here’s our first attempt:

# fasta_data = """>seq1 E. coli promoter region
# ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
# >seq2 Yeast regulatory element
# ATATATATGCATGCATATATGCATGCATATATATAT
# >seq3 Human CpG island
# GCGCGCATATATGCGCGCATATGCGCGCGCATATAT
# """

# CAUTION: This version has a bug!

sequences = {}
current_name = None
current_seq = ""

for line in fasta_data.strip().split('\n'):
    if line.startswith('>'):
        # Save the previous sequence (if there is one)
        if current_name:
            sequences[current_name] = current_seq
        # Start a new sequence
        current_name = line[1:].split()[0]
        current_seq = ""
    else:
        # Append to current sequence
        current_seq += line

print("Sequences found:", list(sequences.keys()))
print(f"Number of sequences: {len(sequences)}")

Sequences found: ['seq1', 'seq2']
Number of sequences: 2

The Last Sequence Problem

Wait! We have 3 sequences in our FASTA data, but only 2 were captured!

What happened to seq3?

Let’s trace through the logic:

Line	Action	`sequences` dict
`>seq1 E. coli promoter`	Start seq1	`{}`
`ATGCGATC...`	Append to seq1	`{}`
`>seq2 Yeast regulatory`	Save seq1, start seq2	`{'seq1': '...'}`
`ATATATA...`	Append to seq2	`{'seq1': '...'}`
`>seq3 Human CpG island`	Save seq2, start seq3	`{'seq1': '...', 'seq2': '...'}`
`GCGCGCA...`	Append to seq3	`{'seq1': '...', 'seq2': '...'}`
(end of file)	???	seq3 never saved!

The problem: We only save a sequence when we encounter the NEXT >. But the last sequence has no > after it!

The Fix: Don’t Forget the Last Sequence

After the loop ends, we need to check if there’s a pending sequence that hasn’t been saved yet:

sequences = {}
current_name = None
current_seq = ""

for line in fasta_data.strip().split('\n'):
    if line.startswith('>'):
        # Save the previous sequence (if there is one)
        if current_name:
            sequences[current_name] = current_seq
        # Start a new sequence
        current_name = line[1:].split()[0]
        current_seq = ""
    else:
        # Append to current sequence
        current_seq += line

# Don't forget the last sequence!
if current_name:
    sequences[current_name] = current_seq

print("Sequences found:", list(sequences.keys()))
print(f"Number of sequences: {len(sequences)}")

Sequences found: ['seq1', 'seq2', 'seq3']
Number of sequences: 3

Now we have all three sequences! Let’s verify by iterating through the dictionary.

About .items(): When iterating over a dictionary, for name, seq in sequences.items(): gives us both the key and value in each iteration. Other options: - for name in sequences: — just the keys - for seq in sequences.values(): — just the values - for name, seq in sequences.items(): — both (most useful)

for name, seq in sequences.items():
    print(f"{name}: {seq[:20]}... (length: {len(seq)})")

seq1: ATGCGATCGATCGATCGATC... (length: 41)
seq2: ATATATATGCATGCATATAT... (length: 36)
seq3: GCGCGCATATATGCGCGCAT... (length: 36)

Wrapping It in a Function

Let’s create a reusable function:

def parse_fasta(fasta_string):
    """
    Parse a FASTA-formatted string into a dictionary.
    
    Args:
        fasta_string: A string containing FASTA-formatted sequences
        
    Returns:
        A dictionary mapping sequence names to sequences
    """
    sequences = {}
    current_name = None
    current_seq = ""
    
    for line in fasta_string.strip().split('\n'):
        if line.startswith('>'):
            if current_name:
                sequences[current_name] = current_seq
            current_name = line[1:].split()[0]
            current_seq = ""
        else:
            current_seq += line
    
    # Don't forget the last sequence!
    if current_name:
        sequences[current_name] = current_seq
    
    return sequences

# Test our function
result = parse_fasta(fasta_data)
print(result)

{'seq1': 'ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG', 'seq2': 'ATATATATGCATGCATATATGCATGCATATATATAT', 'seq3': 'GCGCGCATATATGCGCGCATATGCGCGCGCATATAT'}

Handling Multi-line Sequences

Real FASTA files often split long sequences across multiple lines (typically 60 or 80 characters per line):

>seq1 A long sequence
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
>seq2 Another sequence
...

Good news: Our parser already handles this correctly! Here’s why:

The line current_seq += line appends each sequence line to current_seq. So when we encounter multiple consecutive sequence lines (lines that don’t start with >), they all get concatenated together:

Line read	Action	`current_seq`
`>seq1 ...`	Start new sequence	`""`
`ATGCGATC...`	Append	`"ATGCGATC..."`
`GATCGATC...`	Append	`"ATGCGATC...GATCGATC..."`
`GATCGATC...`	Append	`"ATGCGATC...GATCGATC...GATCGATC..."`
`>seq2 ...`	Save seq1, start seq2	`""`

The key insight: we only save a sequence when we hit the next > (or end of file), so all the intermediate lines get accumulated.

# Let's prove it works with multi-line sequences
multiline_example = """>gene1 Split across three lines
ATGCGATCGATCGATCGATC
GGGGCCCCAAAATTTT
ATATATAT
>gene2 Also multi-line
CCCCGGGG
AAAATTTT
"""

test_multiline = parse_fasta(multiline_example)

for name, seq in test_multiline.items():
    print(f"{name}: {seq}")
    print(f"  Length: {len(seq)} bp\n")

gene1: ATGCGATCGATCGATCGATCGGGGCCCCAAAATTTTATATATAT
  Length: 44 bp

gene2: CCCCGGGGAAAATTTT
  Length: 16 bp

String Operations for Sequence Analysis

Now that we can parse FASTA files, let’s explore string operations useful for analyzing sequences. These concepts come from Think Python chapters 5, 7, and 8.

String Indexing and Slicing (Chapter 8)

Strings are sequences of characters. We can access individual characters using indexing and extract substrings using slicing:

seq = result['seq1']
print(f"Full sequence: {seq}")
print(f"Length: {len(seq)}")

# Indexing: access individual characters
print(f"\nFirst nucleotide (index 0): {seq[0]}")
print(f"Last nucleotide (index -1): {seq[-1]}")
print(f"Fifth nucleotide (index 4): {seq[4]}")

# Slicing: extract substrings [start:end]
print(f"\nFirst 10 nucleotides [0:10]: {seq[0:10]}")
print(f"Last 10 nucleotides [-10:]: {seq[-10:]}")
print(f"Middle portion [15:25]: {seq[15:25]}")

Full sequence: ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
Length: 41

First nucleotide (index 0): A
Last nucleotide (index -1): G
Fifth nucleotide (index 4): G

First 10 nucleotides [0:10]: ATGCGATCGA
Last 10 nucleotides [-10:]: GGAATTCCGG
Middle portion [15:25]: CGATCGATCG

The `in` Operator and Searching (Chapter 7)

The in operator checks if a substring exists within a string. This is useful for finding motifs:

# The 'in' operator returns True or False
print("'ATG' in seq1:", 'ATG' in seq)
print("'TATA' in seq1:", 'TATA' in seq)
print("'GAATTC' in seq1:", 'GAATTC' in seq)  # EcoRI restriction site

# Search for restriction sites in all sequences
restriction_site = 'GAATTC'  # EcoRI
for name, sequence in result.items():
    if restriction_site in sequence:
        print(f"{name}: Contains EcoRI site ({restriction_site})")
    else:
        print(f"{name}: No EcoRI site found")

'ATG' in seq1: True
'TATA' in seq1: False
'GAATTC' in seq1: True
seq1: Contains EcoRI site (GAATTC)
seq2: No EcoRI site found
seq3: No EcoRI site found

String Methods (Chapter 7 & 8)

String methods are functions called using dot notation. Here are some useful ones for sequence analysis:

# .count() - count occurrences of a substring
print(f"Number of 'A' in seq1: {seq.count('A')}")
print(f"Number of 'GC' dinucleotides: {seq.count('GC')}")
print(f"Number of 'ATG' (start codons): {seq.count('ATG')}")

# .upper() and .lower() - case conversion
mixed_case = "AtGcGaTc"
print(f"\nOriginal: {mixed_case}")
print(f"Upper: {mixed_case.upper()}")
print(f"Lower: {mixed_case.lower()}")

# .replace() - substitute substrings
print(f"\nOriginal seq1: {seq[:20]}...")
# Convert DNA to RNA (T → U)
rna = seq.replace('T', 'U')
print(f"As RNA: {rna[:20]}...")

Number of 'A' in seq1: 10
Number of 'GC' dinucleotides: 1
Number of 'ATG' (start codons): 1

Original: AtGcGaTc
Upper: ATGCGATC
Lower: atgcgatc

Original seq1: ATGCGATCGATCGATCGATC...
As RNA: AUGCGAUCGAUCGAUCGAUC...

Conditionals and Boolean Operators (Chapter 5)

We can combine conditions using and, or, and not to create more complex filters:

# Example: combining conditions with and, or, not
for name, sequence in result.items():
    has_start = 'ATG' in sequence
    is_long = len(sequence) > 35
    
    # 'and' - both must be true
    if has_start and is_long:
        print(f"{name}: has start codon AND is long")
    
    # 'or' - at least one must be true
    if has_start or is_long:
        print(f"{name}: has start codon OR is long")
    
    # 'not' - negates the condition
    if not has_start:
        print(f"{name}: does NOT have start codon")

seq1: has start codon AND is long
seq1: has start codon OR is long
seq2: has start codon AND is long
seq2: has start codon OR is long
seq3: has start codon AND is long
seq3: has start codon OR is long

Counters and the Modulus Operator (Chapter 5 & 7)

A counter starts at 0 and increments when a condition is met. The modulus operator (%) returns the remainder after division—useful for checking divisibility.

# Count adenines using a counter variable
seq = result['seq1']

a_count = 0
for nucleotide in seq:
    if nucleotide == 'A':
        a_count += 1

print(f"Adenines in seq1: {a_count}")
print(f"(Same as seq.count('A'): {seq.count('A')})")

# Modulus: check if length is divisible by 3 (complete codons)
if len(seq) % 3 == 0:
    print(f"Length {len(seq)} is divisible by 3")
else:
    print(f"Length {len(seq)} has remainder {len(seq) % 3}")

Adenines in seq1: 10
(Same as seq.count('A'): 10)
Length 41 has remainder 2

Reading FASTA from Files (Chapter 7 & 8)

The with statement safely opens and closes files. We can adapt our parser to read from a file:

# Write our FASTA data to a file
with open('example.fasta', 'w') as f:
    f.write(fasta_data)
print("Created example.fasta")

# Read it back using our parse logic
sequences = {}
current_name = None
current_seq = ""

with open('example.fasta', 'r') as f:
    for line in f:
        line = line.strip()
        if line.startswith('>'):
            if current_name:
                sequences[current_name] = current_seq
            current_name = line[1:].split()[0]
            current_seq = ""
        else:
            current_seq += line
    if current_name:
        sequences[current_name] = current_seq

print("Read from file:", list(sequences.keys()))

Created example.fasta
Read from file: ['seq1', 'seq2', 'seq3']

Using the Dictionary

Now we can easily access any sequence by name:

# Get a specific sequence
print("E. coli promoter sequence:")
print(result['seq1'])

E. coli promoter sequence:
ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG

# Compare sequence lengths
for name, seq in result.items():
    print(f"{name}: {len(seq)} nucleotides")

seq1: 41 nucleotides
seq2: 36 nucleotides
seq3: 36 nucleotides

Visualizing GC Content Across Sequences

Now let’s apply our FASTA parser to a more interesting problem: analyzing GC content in sliding windows across longer sequences. First, let’s define some longer DNA sequences:

import random

random.seed(42)  # Makes results reproducible

def generate_biased_seq(length, gc_start, gc_end):
    """Generate a sequence with one GC-rich region."""
    seq = []
    for i in range(length):
        if gc_start <= i < gc_end:
            # GC-rich: 70% G/C, 30% A/T
            seq.append(random.choice('GCGCGCGCAT'))
        else:
            # AT-rich: 30% G/C, 70% A/T
            seq.append(random.choice('ATATATATGC'))
    return ''.join(seq)

# Create sequences with distinct GC patterns
seq_ecoli = generate_biased_seq(500, 100, 300)   # GC-rich in middle
seq_yeast = generate_biased_seq(500, 0, 150)     # GC-rich at start
seq_human = generate_biased_seq(500, 350, 500)   # GC-rich at end

multiline_fasta = f""">ecoli E. coli genomic region
{seq_ecoli}
>yeast Yeast promoter
{seq_yeast}
>human Human sequence
{seq_human}
"""

sequences = parse_fasta(multiline_fasta)

for name, seq in sequences.items():
    print(f"{name}: {len(seq)} bp")

ecoli: 500 bp
yeast: 500 bp
human: 500 bp

GC Content Function

Let’s define a function to calculate GC content (from Lecture 4):

def gc_content(sequence):
    """Calculate GC content as a percentage."""
    g = sequence.count('G')
    c = sequence.count('C')
    return (g + c) / len(sequence) * 100

Sliding Window Analysis

A sliding window moves across the sequence, calculating GC content at each position:

import pandas as pd

window_size = 50
gc_data = []

# For each sequence...
for name, seq in sequences.items():
    # ...slide a window across it
    for i in range(len(seq) - window_size + 1):
        window = seq[i:i + window_size]  # Extract 50 bp
        gc = gc_content(window)           # Calculate GC%
        gc_data.append({
            'sequence': name,
            'position': i,
            'gc_percent': gc
        })

df = pd.DataFrame(gc_data)
df.head()

	sequence	position	gc_percent
0	ecoli	0	20.0
1	ecoli	1	20.0
2	ecoli	2	22.0
3	ecoli	3	22.0
4	ecoli	4	24.0

Plotting GC Content Profiles

Now let’s visualize how GC content varies along each sequence:

import altair as alt

alt.Chart(df).mark_line().encode(
    x=alt.X('position:Q', title='Position (bp)'),
    y=alt.Y('gc_percent:Q', title='GC Content (%)', scale=alt.Scale(domain=[0, 100])),
    color=alt.Color('sequence:N', title='Sequence'),
    strokeWidth=alt.value(2)
).properties(
    width=600,
    height=300,
    title='GC Content in Sliding Windows (50 bp)'
).configure_legend(
    orient='bottom'
)

Summary

This lecture demonstrated Python concepts from Think Python chapters 5, 7, and 8:

From Chapter 5 (Conditionals): - Boolean expressions and comparisons - Logical operators: and, or, not - The modulus operator % for divisibility checks

From Chapter 7 (Iteration and Search): - For loops iterating through strings - The in operator for substring search - Counters for tracking occurrences - String methods: .strip(), .split()

From Chapter 8 (Strings): - String indexing: seq[0], seq[-1] - String slicing: seq[start:end] - String methods: .count(), .upper(), .lower(), .replace(), .startswith() - File operations: open(), reading line by line

Biological applications: - Parsing FASTA format files - Searching for motifs and restriction sites - Calculating nucleotide composition - Sliding window analysis for GC content - Visualization of sequence properties

Homework

Chapters 9, 10, 11 from “Think Python” 3rd edition