Working with biological sequence data using strings, conditionals, and iteration
This lecture demonstrates concepts from Think Python chapters 5, 7, and 8.
Methods vs Functions: A function is called by name: len(seq), print(x). A method is called on an object using dot notation: seq.count('A'), line.startswith('>'). Methods are functions that “belong to” a specific type—string methods work on strings, list methods work on lists, etc.
What is FASTA format?
FASTA is one of the most common formats for storing biological sequences. Each sequence entry consists of:
A header line starting with >, followed by the sequence name and optional description
One or more lines containing the sequence itself
Here’s an example with three DNA sequences:
fasta_data =""">seq1 E. coli promoter regionATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG>seq2 Yeast regulatory elementATATATATGCATGCATATATGCATGCATATATATAT>seq3 Human CpG islandGCGCGCATATATGCGCGCATATGCGCGCGCATATAT"""print(fasta_data)
>seq1 E. coli promoter region
ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
>seq2 Yeast regulatory element
ATATATATGCATGCATATATGCATGCATATATATAT
>seq3 Human CpG island
GCGCGCATATATGCGCGCATATGCGCGCGCATATAT
Note on triple quotes: In Python, triple quotes (""" or ''') allow you to create multi-line strings. This is perfect for storing FASTA data directly in your code without needing to read from a file.
Our Goal: A Dictionary
We want to convert this FASTA data into a Python dictionary where: - Keys are sequence names (e.g., 'seq1', 'seq2', 'seq3') - Values are the corresponding sequences
Why a dictionary? Because it gives us fast lookup by name. If we want the sequence for seq2, we simply write sequences['seq2'].
Formatted String Literals (f-strings)
Before we start parsing, let’s learn about f-strings—a convenient way to embed variables and expressions inside strings. An f-string starts with f before the opening quote, and uses curly braces {} to insert values:
# Basic f-string usagename ="seq1"length =42# Without f-strings (concatenation)print("Sequence "+ name +" has length "+str(length))# With f-strings (cleaner!)print(f"Sequence {name} has length {length}")
Sequence seq1 has length 42
Sequence seq1 has length 42
# F-strings can contain expressions, not just variablessequence ="ATGCGATC"print(f"Sequence: {sequence}")print(f"Length: {len(sequence)}")print(f"First 4 bases: {sequence[:4]}")print(f"GC count: {sequence.count('G') + sequence.count('C')}")
Why f-strings? They’re more readable than string concatenation ("text" + var + "more"), and you don’t need to manually convert numbers to strings with str(). We’ll use f-strings throughout this lecture for output.
Parsing Logic: Step by Step
Step 1: Split into lines
First, we split our FASTA string into individual lines. We use two string methods chained together:
.strip() removes any leading/trailing whitespace (spaces, tabs, newlines) from the string. This is important because our multi-line string might have extra blank lines at the beginning or end.
.split('\n') breaks the string at each newline character (\n), returning a list of strings—one for each line.
The \n is an escape sequence representing a newline (line break). When you press Enter in a text file, it inserts a \n character.
# Let's see what strip() doestext_with_whitespace =" \n\n Hello World \n\n "print(f"Before strip: '{text_with_whitespace}'")print(f"After strip: '{text_with_whitespace.strip()}'")# And what split('\n') doesmultiline ="Line 1\nLine 2\nLine 3"print(f"\nOriginal: {multiline}")print(f"After split: {multiline.split('\n')}")
Before strip: '
Hello World
'
After strip: 'Hello World'
Original: Line 1
Line 2
Line 3
After split: ['Line 1', 'Line 2', 'Line 3']
Now let’s apply this to our FASTA data. The enumerate() function gives us both the index and value as we loop:
lines = fasta_data.strip().split('\n')for i, line inenumerate(lines):print(f"Line {i}: {line}")
Line 0: >seq1 E. coli promoter region
Line 1: ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
Line 2: >seq2 Yeast regulatory element
Line 3: ATATATATGCATGCATATATGCATGCATATATATAT
Line 4: >seq3 Human CpG island
Line 5: GCGCGCATATATGCGCGCATATGCGCGCGCATATAT
Step 2: Identify header lines
Header lines start with >. We can detect them using the .startswith() method, which returns True if a string begins with the specified prefix, and False otherwise.
This is cleaner than checking line[0] == '>' because: - It’s more readable - It won’t crash on empty strings (whereas line[0] would raise an IndexError)
for line in lines:if line.startswith('>'):print(f"HEADER: {line}")else:print(f"SEQUENCE: {line}")
HEADER: >seq1 E. coli promoter region
SEQUENCE: ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
HEADER: >seq2 Yeast regulatory element
SEQUENCE: ATATATATGCATGCATATATGCATGCATATATATAT
HEADER: >seq3 Human CpG island
SEQUENCE: GCGCGCATATATGCGCGCATATGCGCGCGCATATAT
Step 3: Extract the sequence name
The sequence name is the first “word” after the >. We can extract it by: 1. Removing the > character (using line[1:] to skip the first character) 2. Splitting on whitespace and taking the first element
About line[1:]: This is called slicing. The syntax string[start:end] extracts characters from index start up to (but not including) index end. When we omit end, it means “go to the end of the string.” So line[1:] means “everything from index 1 onwards”—i.e., skip the first character (the >).
About .split(): When called with no arguments, .split() splits on any whitespace (spaces, tabs, multiple spaces) and removes empty strings. This is more robust than .split(' ') which only splits on single spaces.
header =">seq1 E. coli promoter region"# Remove the '>' characterwithout_gt = header[1:]print(f"After removing '>': {without_gt}")# Split on whitespaceparts = without_gt.split()print(f"Split into parts: {parts}")# Take the first part (the name)name = parts[0]print(f"Sequence name: {name}")
After removing '>': seq1 E. coli promoter region
Split into parts: ['seq1', 'E.', 'coli', 'promoter', 'region']
Sequence name: seq1
First Attempt: A Naive Parser
Now let’s put it together. The basic idea is: - When we see a >, save the previous sequence (if any) and start a new one - When we see a sequence line, append it to the current sequence
Two important patterns to notice:
Truthiness test:if current_name: checks if current_name has a “truthy” value. In Python, None and empty strings "" are “falsy” (evaluate to False), while non-empty strings are “truthy”. This is cleaner than writing if current_name != None and current_name != "":
Augmented assignment:current_seq += line is shorthand for current_seq = current_seq + line. It appends line to the end of current_seq.
Here’s our first attempt:
# fasta_data = """>seq1 E. coli promoter region# ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG# >seq2 Yeast regulatory element# ATATATATGCATGCATATATGCATGCATATATATAT# >seq3 Human CpG island# GCGCGCATATATGCGCGCATATGCGCGCGCATATAT# """# CAUTION: This version has a bug!sequences = {}current_name =Nonecurrent_seq =""for line in fasta_data.strip().split('\n'):if line.startswith('>'):# Save the previous sequence (if there is one)if current_name: sequences[current_name] = current_seq# Start a new sequence current_name = line[1:].split()[0] current_seq =""else:# Append to current sequence current_seq += lineprint("Sequences found:", list(sequences.keys()))print(f"Number of sequences: {len(sequences)}")
Sequences found: ['seq1', 'seq2']
Number of sequences: 2
The Last Sequence Problem
Wait! We have 3 sequences in our FASTA data, but only 2 were captured!
What happened to seq3?
Let’s trace through the logic:
Line
Action
sequences dict
>seq1 E. coli promoter
Start seq1
{}
ATGCGATC...
Append to seq1
{}
>seq2 Yeast regulatory
Save seq1, start seq2
{'seq1': '...'}
ATATATA...
Append to seq2
{'seq1': '...'}
>seq3 Human CpG island
Save seq2, start seq3
{'seq1': '...', 'seq2': '...'}
GCGCGCA...
Append to seq3
{'seq1': '...', 'seq2': '...'}
(end of file)
???
seq3 never saved!
The problem: We only save a sequence when we encounter the NEXT >. But the last sequence has no > after it!
The Fix: Don’t Forget the Last Sequence
After the loop ends, we need to check if there’s a pending sequence that hasn’t been saved yet:
sequences = {}current_name =Nonecurrent_seq =""for line in fasta_data.strip().split('\n'):if line.startswith('>'):# Save the previous sequence (if there is one)if current_name: sequences[current_name] = current_seq# Start a new sequence current_name = line[1:].split()[0] current_seq =""else:# Append to current sequence current_seq += line# Don't forget the last sequence!if current_name: sequences[current_name] = current_seqprint("Sequences found:", list(sequences.keys()))print(f"Number of sequences: {len(sequences)}")
Sequences found: ['seq1', 'seq2', 'seq3']
Number of sequences: 3
Now we have all three sequences! Let’s verify by iterating through the dictionary.
About .items(): When iterating over a dictionary, for name, seq in sequences.items(): gives us both the key and value in each iteration. Other options: - for name in sequences: — just the keys - for seq in sequences.values(): — just the values - for name, seq in sequences.items(): — both (most useful)
for name, seq in sequences.items():print(f"{name}: {seq[:20]}... (length: {len(seq)})")
Real FASTA files often split long sequences across multiple lines (typically 60 or 80 characters per line):
>seq1 A long sequence
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
GATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATC
>seq2 Another sequence
...
Good news: Our parser already handles this correctly! Here’s why:
The line current_seq += lineappends each sequence line to current_seq. So when we encounter multiple consecutive sequence lines (lines that don’t start with >), they all get concatenated together:
Line read
Action
current_seq
>seq1 ...
Start new sequence
""
ATGCGATC...
Append
"ATGCGATC..."
GATCGATC...
Append
"ATGCGATC...GATCGATC..."
GATCGATC...
Append
"ATGCGATC...GATCGATC...GATCGATC..."
>seq2 ...
Save seq1, start seq2
""
The key insight: we only save a sequence when we hit the next> (or end of file), so all the intermediate lines get accumulated.
# Let's prove it works with multi-line sequencesmultiline_example =""">gene1 Split across three linesATGCGATCGATCGATCGATCGGGGCCCCAAAATTTTATATATAT>gene2 Also multi-lineCCCCGGGGAAAATTTT"""test_multiline = parse_fasta(multiline_example)for name, seq in test_multiline.items():print(f"{name}: {seq}")print(f" Length: {len(seq)} bp\n")
gene1: ATGCGATCGATCGATCGATCGGGGCCCCAAAATTTTATATATAT
Length: 44 bp
gene2: CCCCGGGGAAAATTTT
Length: 16 bp
String Operations for Sequence Analysis
Now that we can parse FASTA files, let’s explore string operations useful for analyzing sequences. These concepts come from Think Python chapters 5, 7, and 8.
String Indexing and Slicing (Chapter 8)
Strings are sequences of characters. We can access individual characters using indexing and extract substrings using slicing:
Full sequence: ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
Length: 41
First nucleotide (index 0): A
Last nucleotide (index -1): G
Fifth nucleotide (index 4): G
First 10 nucleotides [0:10]: ATGCGATCGA
Last 10 nucleotides [-10:]: GGAATTCCGG
Middle portion [15:25]: CGATCGATCG
The in Operator and Searching (Chapter 7)
The in operator checks if a substring exists within a string. This is useful for finding motifs:
# The 'in' operator returns True or Falseprint("'ATG' in seq1:", 'ATG'in seq)print("'TATA' in seq1:", 'TATA'in seq)print("'GAATTC' in seq1:", 'GAATTC'in seq) # EcoRI restriction site# Search for restriction sites in all sequencesrestriction_site ='GAATTC'# EcoRIfor name, sequence in result.items():if restriction_site in sequence:print(f"{name}: Contains EcoRI site ({restriction_site})")else:print(f"{name}: No EcoRI site found")
'ATG' in seq1: True
'TATA' in seq1: False
'GAATTC' in seq1: True
seq1: Contains EcoRI site (GAATTC)
seq2: No EcoRI site found
seq3: No EcoRI site found
String Methods (Chapter 7 & 8)
String methods are functions called using dot notation. Here are some useful ones for sequence analysis:
# .count() - count occurrences of a substringprint(f"Number of 'A' in seq1: {seq.count('A')}")print(f"Number of 'GC' dinucleotides: {seq.count('GC')}")print(f"Number of 'ATG' (start codons): {seq.count('ATG')}")# .upper() and .lower() - case conversionmixed_case ="AtGcGaTc"print(f"\nOriginal: {mixed_case}")print(f"Upper: {mixed_case.upper()}")print(f"Lower: {mixed_case.lower()}")# .replace() - substitute substringsprint(f"\nOriginal seq1: {seq[:20]}...")# Convert DNA to RNA (T → U)rna = seq.replace('T', 'U')print(f"As RNA: {rna[:20]}...")
Number of 'A' in seq1: 10
Number of 'GC' dinucleotides: 1
Number of 'ATG' (start codons): 1
Original: AtGcGaTc
Upper: ATGCGATC
Lower: atgcgatc
Original seq1: ATGCGATCGATCGATCGATC...
As RNA: AUGCGAUCGAUCGAUCGAUC...
Conditionals and Boolean Operators (Chapter 5)
We can combine conditions using and, or, and not to create more complex filters:
# Example: combining conditions with and, or, notfor name, sequence in result.items(): has_start ='ATG'in sequence is_long =len(sequence) >35# 'and' - both must be trueif has_start and is_long:print(f"{name}: has start codon AND is long")# 'or' - at least one must be trueif has_start or is_long:print(f"{name}: has start codon OR is long")# 'not' - negates the conditionifnot has_start:print(f"{name}: does NOT have start codon")
seq1: has start codon AND is long
seq1: has start codon OR is long
seq2: has start codon AND is long
seq2: has start codon OR is long
seq3: has start codon AND is long
seq3: has start codon OR is long
Counters and the Modulus Operator (Chapter 5 & 7)
A counter starts at 0 and increments when a condition is met. The modulus operator (%) returns the remainder after division—useful for checking divisibility.
# Count adenines using a counter variableseq = result['seq1']a_count =0for nucleotide in seq:if nucleotide =='A': a_count +=1print(f"Adenines in seq1: {a_count}")print(f"(Same as seq.count('A'): {seq.count('A')})")# Modulus: check if length is divisible by 3 (complete codons)iflen(seq) %3==0:print(f"Length {len(seq)} is divisible by 3")else:print(f"Length {len(seq)} has remainder {len(seq) %3}")
Adenines in seq1: 10
(Same as seq.count('A'): 10)
Length 41 has remainder 2
Reading FASTA from Files (Chapter 7 & 8)
The with statement safely opens and closes files. We can adapt our parser to read from a file:
# Write our FASTA data to a filewithopen('example.fasta', 'w') as f: f.write(fasta_data)print("Created example.fasta")# Read it back using our parse logicsequences = {}current_name =Nonecurrent_seq =""withopen('example.fasta', 'r') as f:for line in f: line = line.strip()if line.startswith('>'):if current_name: sequences[current_name] = current_seq current_name = line[1:].split()[0] current_seq =""else: current_seq += lineif current_name: sequences[current_name] = current_seqprint("Read from file:", list(sequences.keys()))
Created example.fasta
Read from file: ['seq1', 'seq2', 'seq3']
Using the Dictionary
Now we can easily access any sequence by name:
# Get a specific sequenceprint("E. coli promoter sequence:")print(result['seq1'])
E. coli promoter sequence:
ATGCGATCGATCGATCGATCGATCGAATTCCGGAATTCCGG
# Compare sequence lengthsfor name, seq in result.items():print(f"{name}: {len(seq)} nucleotides")
Now let’s apply our FASTA parser to a more interesting problem: analyzing GC content in sliding windows across longer sequences. First, let’s define some longer DNA sequences:
import randomrandom.seed(42) # Makes results reproducibledef generate_biased_seq(length, gc_start, gc_end):"""Generate a sequence with one GC-rich region.""" seq = []for i inrange(length):if gc_start <= i < gc_end:# GC-rich: 70% G/C, 30% A/T seq.append(random.choice('GCGCGCGCAT'))else:# AT-rich: 30% G/C, 70% A/T seq.append(random.choice('ATATATATGC'))return''.join(seq)# Create sequences with distinct GC patternsseq_ecoli = generate_biased_seq(500, 100, 300) # GC-rich in middleseq_yeast = generate_biased_seq(500, 0, 150) # GC-rich at startseq_human = generate_biased_seq(500, 350, 500) # GC-rich at endmultiline_fasta =f""">ecoli E. coli genomic region{seq_ecoli}>yeast Yeast promoter{seq_yeast}>human Human sequence{seq_human}"""sequences = parse_fasta(multiline_fasta)for name, seq in sequences.items():print(f"{name}: {len(seq)} bp")
ecoli: 500 bp
yeast: 500 bp
human: 500 bp
GC Content Function
Let’s define a function to calculate GC content (from Lecture 4):
def gc_content(sequence):"""Calculate GC content as a percentage.""" g = sequence.count('G') c = sequence.count('C')return (g + c) /len(sequence) *100
Sliding Window Analysis
A sliding window moves across the sequence, calculating GC content at each position:
import pandas as pdwindow_size =50gc_data = []# For each sequence...for name, seq in sequences.items():# ...slide a window across itfor i inrange(len(seq) - window_size +1): window = seq[i:i + window_size] # Extract 50 bp gc = gc_content(window) # Calculate GC% gc_data.append({'sequence': name,'position': i,'gc_percent': gc })df = pd.DataFrame(gc_data)df.head()
sequence
position
gc_percent
0
ecoli
0
20.0
1
ecoli
1
20.0
2
ecoli
2
22.0
3
ecoli
3
22.0
4
ecoli
4
24.0
Plotting GC Content Profiles
Now let’s visualize how GC content varies along each sequence:
import altair as altalt.Chart(df).mark_line().encode( x=alt.X('position:Q', title='Position (bp)'), y=alt.Y('gc_percent:Q', title='GC Content (%)', scale=alt.Scale(domain=[0, 100])), color=alt.Color('sequence:N', title='Sequence'), strokeWidth=alt.value(2)).properties( width=600, height=300, title='GC Content in Sliding Windows (50 bp)').configure_legend( orient='bottom')
Summary
This lecture demonstrated Python concepts from Think Python chapters 5, 7, and 8:
From Chapter 5 (Conditionals): - Boolean expressions and comparisons - Logical operators: and, or, not - The modulus operator % for divisibility checks
From Chapter 7 (Iteration and Search): - For loops iterating through strings - The in operator for substring search - Counters for tracking occurrences - String methods: .strip(), .split()
From Chapter 8 (Strings): - String indexing: seq[0], seq[-1] - String slicing: seq[start:end] - String methods: .count(), .upper(), .lower(), .replace(), .startswith() - File operations: open(), reading line by line
Biological applications: - Parsing FASTA format files - Searching for motifs and restriction sites - Calculating nucleotide composition - Sliding window analysis for GC content - Visualization of sequence properties
Homework
Chapters 9, 10, 11 from “Think Python” 3rd edition