# Store a DNA sequence as a string variable
dna = "ATGCGATCGATCGATCGATCGATCG"
# Display the sequence
dna'ATGCGATCGATCGATCGATCGATCG'
This lecture demonstrates concepts from Think Python chapters 1-3 using biological data.
In Python, we can represent a DNA sequence as a string—a sequence of characters. Let’s start with a short sequence from the E. coli genome:
'ATGCGATCGATCGATCGATCGATCG'
We can use built-in functions to learn about our sequence:
<class 'str'>
25
We can concatenate (join) strings using + and repeat them using *:
Full sequence: TATAAAATGCGATCG
Poly-A tail: AAAAAAAAAAAAAAAAAAAA
A fundamental task in sequence analysis is counting how many of each nucleotide (A, T, G, C) appear in a sequence. We can use the .count() method:
A: 10
T: 10
G: 11
C: 10
The total of all nucleotides should equal the sequence length. Let’s use arithmetic operators to verify:
GC content is the percentage of nucleotides that are either G or C. This is biologically important because: - Higher GC content = more stable DNA (3 hydrogen bonds vs 2 for AT) - GC content varies between species and genomic regions - Used in gene prediction and genome analysis
Formula: GC% = (G + C) / total × 100
GC content: 51.21951219512195 %
We can use round() to clean up the output:
GC content: 51.2 %
The print() function can take multiple arguments, separated by spaces:
Python’s math module provides mathematical functions. We can use it to calculate Shannon entropy—a measure of sequence complexity from information theory.
Pi = 3.141592653589793
Square root of 16: 4.0
Log base 2 of 8: 3.0
Shannon entropy measures how “random” or “complex” a sequence is. A sequence with equal proportions of A, T, G, C has maximum entropy (2 bits), while a sequence of all A’s has entropy of 0.
Formula: H = -Σ p(x) × log₂(p(x))
Frequencies:
A: 0.244
T: 0.244
G: 0.268
C: 0.244
Shannon entropy: 1.999 bits
Maximum possible: 2.0 bits
A function is a reusable block of code. Let’s create a function to calculate GC content so we can easily apply it to any sequence:
Let’s break down the function: - def gc_content(sequence): — defines the function name and parameter - The docstring ("""...""") explains what the function does - The body is indented (4 spaces) - return specifies what value the function gives back
Test 1: 50.0
Test 2: 100.0
Test 3: 0.0
Now let’s create a function for Shannon entropy—wrapping our earlier calculation into a reusable form:
def shannon_entropy(sequence):
"""
Calculate Shannon entropy of a DNA sequence.
Args:
sequence: A DNA sequence string
Returns:
Entropy in bits (0 to 2 for DNA)
"""
length = len(sequence)
entropy = 0
for nucleotide in ['A', 'T', 'G', 'C']:
count = sequence.count(nucleotide)
if count > 0:
freq = count / length
entropy = entropy - freq * math.log2(freq)
return entropyTest 1 (balanced): 2.0
Test 2 (all A's): 0.0
Test 3 (AT only): 1.0
A for loop lets us examine each character in a sequence one at a time. Let’s implement our own nucleotide counter to understand what .count() does internally:
Let’s create a more flexible counting function that takes both the sequence and the target nucleotide as parameters:
def count_nucleotide(sequence, target):
"""
Count occurrences of a nucleotide in a sequence.
Args:
sequence: A DNA sequence string
target: The nucleotide to count ('A', 'T', 'G', or 'C')
Returns:
The count of the target nucleotide
"""
count = 0
for nucleotide in sequence:
if nucleotide == target:
count = count + 1
return countWe can build more complex functions by combining simpler ones. Let’s rewrite gc_content to use our count_nucleotide function:
GC content: 52.94117647058824
Variables created inside a function are local—they only exist within that function:
Let’s compare GC content across sequences from different organisms:
GC Content Comparison
========================================
E. coli: 51.5 %
Yeast: 30.8 %
Human: 61.5 %
Plasmodium: 0.0 %
The range() function generates a sequence of numbers, useful when you need to count:
Number: 0
Number: 1
Number: 2
Number: 3
Number: 4
Position GC%
0 60.0
1 60.0
2 60.0
3 50.0
4 40.0
5 30.0
6 20.0
7 20.0
8 20.0
9 30.0
10 40.0
11 50.0
12 60.0
13 60.0
14 60.0
15 60.0
16 60.0
17 50.0
18 40.0
19 30.0
20 20.0
To create charts, we need two libraries: - Pandas — for organizing data into tables (DataFrames) - Altair — for creating visualizations
The as keyword creates a shorter alias: instead of writing altair.Chart(...), we can write alt.Chart(...).
Before we can plot, we need to organize our results. We’ll build a list of dictionaries where each dictionary represents one row of data:
# Start with an empty list
data = []
# Loop through each organism and its sequence
for name, seq in sequences:
# Create a dictionary for this organism
row = {
"Organism": name,
"GC Content (%)": round(gc_content(seq), 1),
"Shannon Entropy (bits)": round(shannon_entropy(seq), 3)
}
# Add it to our list
data.append(row)
# Look at what we built
data[{'Organism': 'E. coli',
'GC Content (%)': 51.5,
'Shannon Entropy (bits)': 1.998},
{'Organism': 'Yeast', 'GC Content (%)': 30.8, 'Shannon Entropy (bits)': 1.89},
{'Organism': 'Human',
'GC Content (%)': 61.5,
'Shannon Entropy (bits)': 1.961},
{'Organism': 'Plasmodium',
'GC Content (%)': 0.0,
'Shannon Entropy (bits)': 1.0}]
Each dictionary has the same keys ("Organism", "GC Content (%)", "Shannon Entropy (bits)"), which will become column names.
A DataFrame is a table—like a spreadsheet. Pandas can convert our list of dictionaries directly into a DataFrame:
Altair uses a declarative approach: you describe what you want, not how to draw it: - alt.Chart(df) — use this DataFrame - .mark_bar() — draw bars - .encode(x=..., y=...) — map columns to visual properties
Let’s put them together: