Lecture 13: Sequence Alignment

Open In Colab

This lecture draws on material from: - Dynamic Programming and Edit Distance by Ben Langmead (JHU) - Global Alignment by Ben Langmead (JHU) - Local Alignment by Ben Langmead (JHU) - DP in Less Time and Space by Ben Langmead (JHU) - Computational Genomics notebooks by Ben Langmead

Sequence alignment is one of the most fundamental operations in bioinformatics. It allows us to compare DNA, RNA, or protein sequences to identify regions of similarity that may reflect functional, structural, or evolutionary relationships. In this lecture we will build up from the concept of edit distance to full global and local alignment algorithms.

Why align sequences?

Almost every task in genomics involves comparing sequences:

  • Read mapping — aligning short sequencing reads to a reference genome
  • Variant calling — finding differences between a sample and a reference
  • Homology search — finding genes in other species that are related to a query (e.g., BLAST)
  • Genome assembly — finding overlaps between reads
  • Phylogenetics — building evolutionary trees from multiple sequence alignments

All of these rely on the ability to optimally align two (or more) sequences.

Part 1: Edit Distance

Why this matters in biology: Edit distance is the foundation of sequence comparison — it underlies BLAST scoring, read mapping, and variant calling. Every time you align a sequencing read to a reference genome, an edit-distance-like computation is at work.

What is edit distance?

The three edit operations: substitution, insertion, and deletion

The edit distance (also called Levenshtein distance) between two strings is the minimum number of single-character operations required to transform one string into the other. The allowed operations are:

Operation Example Cost
Substitution (replace) AG 1
Insertion (gap in first sequence) -A 1
Deletion (gap in second sequence) A- 1

For example, to transform ACGT into AGT:

ACGT
A-GT    (delete C → cost 1)

The edit distance is 1.

Edit distance vs. Hamming distance

You may have encountered Hamming distance before — it counts the number of positions where two equal-length strings differ (substitutions only). Edit distance is more general because it allows insertions and deletions, so the strings can have different lengths.

Key property: For equal-length strings, \(\text{editDistance}(x, y) \le \text{hammingDistance}(x, y)\). This is because edit distance can always “do at least as well” by considering insertions and deletions alongside substitutions.

A useful lower bound: \(\text{editDistance}(x, y) \ge \big| |x| - |y| \big|\) — at minimum you need enough indels to account for any length difference.

import time

def hamming_distance(x, y):
    """Return Hamming distance between two equal-length strings."""
    assert len(x) == len(y), "Strings must be the same length"
    return sum(c1 != c2 for c1, c2 in zip(x, y))

print(hamming_distance('ACCT', 'ATGG'))  # 1 substitution (G→T)
3

But what if the strings have different lengths?

hamming_distance('ACGT', 'AGT')  # ERROR — different lengths!

This is where edit distance comes in.

A naive recursive solution

We can compute edit distance recursively. The idea: compare the last characters of both strings, and consider all three operations:

\[ \text{edDist}(x, y) = \min \begin{cases} \text{edDist}(x[:-1], y[:-1]) + \delta(x[-1], y[-1]) & \text{(substitution/match)} \\ \text{edDist}(x[:-1], y) + 1 & \text{(deletion from } x\text{)} \\ \text{edDist}(x, y[:-1]) + 1 & \text{(insertion into } x\text{)} \end{cases} \]

where \(\delta(a, b) = 0\) if \(a = b\), and \(1\) otherwise.

import time

def ed_recursive(x, y):
    """Compute edit distance between x and y recursively (naive)."""
    if len(x) == 0: return len(y)
    if len(y) == 0: return len(x)
    delt = 1 if x[-1] != y[-1] else 0
    return min(
        ed_recursive(x[:-1], y[:-1]) + delt,  # substitute / match
        ed_recursive(x[:-1], y) + 1,           # delete from x
        ed_recursive(x, y[:-1]) + 1            # insert into x
    )

t0 = time.time()
print(ed_recursive('ACGTTTTTAC', 'AGTGTTTTAG'))
print(f"Time: {time.time()-t0:.3f}s")
3
Time: 0.649s

Edit transcripts

An edit transcript records the operations that transform one string into another: M (match), R (replace/substitution), D (delete from \(x\)), I (insert into \(x\)).

For example, aligning GCGTATGC to GCTATGCC:

x: G C G T A T G - C
   | |   | | | |   |
y: G C - T A T G C C

The transcript is MMDMMMMIM and the edit distance is 2 (one D + one I).

How the recursion works

The algorithm works backwards from the last characters and considers three choices at each step:

  1. Match / Substitute — Align x[-1] with y[-1]. If they match, cost is 0; if they differ, cost is 1. Recurse on x[:-1] vs y[:-1].
  2. Delete from x — Drop the last character of x (cost 1). Recurse on x[:-1] vs y.
  3. Insert into x — Add y[-1] to x (cost 1). Recurse on x vs y[:-1].

Take the minimum of all three. The base cases: if either string is empty, the cost equals the length of the other string.

For ed("ACT", "AT"), the winning path is: match T → delete C → match A → done (edit distance = 1). But the algorithm evaluates all three branches at every level, which means the same subproblems (like ed("A","A")) get recomputed many times.

Why is this so slow?

Each call spawns three more calls, creating an exponential blowup — roughly \(3^{2n}\) calls for strings of length \(n\). The root cause is redundant computation: many identical subproblems are solved over and over. For example, when comparing two 9-character sequences, the subproblem ed("AC", "AC") alone is recomputed 48,639 times.

Dynamic programming fixes this by computing each subproblem exactly once and storing the results in a table.

Dynamic programming to the rescue

Dynamic programming (DP) avoids redundant computation by building a table of solutions to subproblems. We create an \((m+1) \times (n+1)\) matrix \(D\) where \(D[i][j]\) holds the edit distance between the first \(i\) characters of \(x\) and the first \(j\) characters of \(y\).

Initialization and recurrence

Base cases: \(D[i][0] = i\) (deleting \(i\) characters) and \(D[0][j] = j\) (inserting \(j\) characters).

Recurrence:

\[D[i][j] = \min \begin{cases} D[i-1][j-1] + \delta(x[i-1], y[j-1]) \\ D[i-1][j] + 1 \\ D[i][j-1] + 1 \end{cases}\]

The answer is in \(D[m][n]\). Each cell depends only on the three neighbors above, to the left, and diagonally above-left, so we fill the table row by row.

import time
import numpy as np

def edit_distance(x, y):
    """Compute edit distance between strings x and y using dynamic programming."""
    D = np.zeros((len(x) + 1, len(y) + 1), dtype=int)
    
    # Base cases
    D[0, 1:] = range(1, len(y) + 1)
    D[1:, 0] = range(1, len(x) + 1)
    
    # Fill the matrix
    for i in range(1, len(x) + 1):
        for j in range(1, len(y) + 1):
            delt = 1 if x[i-1] != y[j-1] else 0
            D[i, j] = min(
                D[i-1, j-1] + delt,  # diagonal: substitution or match
                D[i-1, j] + 1,       # vertical: deletion
                D[i, j-1] + 1        # horizontal: insertion
            )
    return D

x, y = 'GCGTATGC', 'TATTGGCTATGCG'
t0 = time.time()
D = edit_distance(x, y)
elapsed = time.time() - t0
print(f"Edit distance between '{x}' and '{y}': {D[len(x), len(y)]}")
print(f"Time: {elapsed:.6f}s")
Edit distance between 'GCGTATGC' and 'TATTGGCTATGCG': 7
Time: 0.000197s
from IPython.display import HTML

HTML('''
<style>
.dp-anim-wrap{font-family:system-ui,sans-serif;max-width:900px}
.dp-anim-wrap table{border-collapse:collapse;margin:10px 0}
.dp-anim-wrap td,.dp-anim-wrap th{width:38px;height:32px;text-align:center;border:1px solid #999;font-size:14px;padding:2px}
.dp-anim-wrap th{background:#e8e8e8;font-weight:600}
.dp-anim-wrap .cell-unfilled{background:#eee;color:#ccc}
.dp-anim-wrap .cell-filled{background:#fff}
.dp-anim-wrap .cell-current{background:#ffd54f;font-weight:700}
.dp-anim-wrap .cell-diag{background:#90caf9}
.dp-anim-wrap .cell-above{background:#a5d6a7}
.dp-anim-wrap .cell-left{background:#ef9a9a}
.dp-anim-wrap .cell-answer{background:#66bb6a;color:#fff;font-weight:700}
.dp-anim-wrap .cell-trace{background:#ce93d8;font-weight:700}
.dp-anim-wrap .cell-trace-cur{background:#ab47bc;color:#fff;font-weight:700}
.dp-anim-wrap .controls{margin:8px 0;display:flex;align-items:center;gap:8px;flex-wrap:wrap}
.dp-anim-wrap button{padding:4px 14px;font-size:14px;cursor:pointer;border:1px solid #888;border-radius:4px;background:#f5f5f5}
.dp-anim-wrap button:hover{background:#e0e0e0}
.dp-anim-wrap .info-panel{background:#f9f9f9;border:1px solid #ddd;border-radius:6px;padding:10px 14px;margin-top:8px;font-size:13px;line-height:1.6;min-height:60px}
.dp-anim-wrap label{font-size:13px}
.dp-anim-wrap input[type=text]{padding:3px 6px;font-size:13px;width:90px;font-family:monospace}
.dp-anim-wrap input[type=range]{width:100px}
.dp-anim-wrap .step-counter{font-size:13px;color:#555;min-width:90px}
.dp-anim-wrap .alignment-box{font-family:monospace;font-size:14px;margin-top:6px;line-height:1.4}
</style>
<div class="dp-anim-wrap" id="dpAnimRoot">
  <div style="margin-bottom:8px">
    <label>x: <input type="text" id="dpInpX" value="ACGT"></label>
    <label style="margin-left:8px">y: <input type="text" id="dpInpY" value="AGT"></label>
    <button id="dpBtnReset" style="margin-left:8px">Reset</button>
  </div>
  <div id="dpGrid"></div>
  <div class="controls">
    <button id="dpBtnBack">Back</button>
    <button id="dpBtnNext">Next</button>
    <button id="dpBtnPlay">Play</button>
    <span class="step-counter" id="dpStepInfo">Step 0 / 0</span>
    <label>Speed: <input type="range" id="dpSpeed" min="1" max="10" value="5"></label>
  </div>
  <div class="info-panel" id="dpInfo">Press <b>Next</b> or <b>Play</b> to begin.</div>
</div>
<script>
(function(){
  /* helper to build closing tags without literal </ in the source */
  function endTag(t){return '<'+'/'+t+'>';}
  function esc(s){return s.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;');}

  var root=document.getElementById('dpAnimRoot');
  var gridDiv=document.getElementById('dpGrid');
  var infoDiv=document.getElementById('dpInfo');
  var stepInfo=document.getElementById('dpStepInfo');
  var inpX=document.getElementById('dpInpX');
  var inpY=document.getElementById('dpInpY');
  var btnBack=document.getElementById('dpBtnBack');
  var btnNext=document.getElementById('dpBtnNext');
  var btnPlay=document.getElementById('dpBtnPlay');
  var btnReset=document.getElementById('dpBtnReset');
  var speedSlider=document.getElementById('dpSpeed');

  var steps=[], curStep=-1, D=[], x='', y='', m=0, n=0;
  var playing=false, timer=null;

  function buildSteps(){
    x=inpX.value.toUpperCase(); y=inpY.value.toUpperCase();
    m=x.length; n=y.length;
    D=[];
    for(var i=0;i<=m;i++){D[i]=[];for(var j=0;j<=n;j++) D[i][j]=null;}
    steps=[];
    /* Phase 1: first row */
    for(var j=0;j<=n;j++){
      D[0][j]=j;
      steps.push({type:'init-row',i:0,j:j,val:j,
        msg:'Initialize D[0]['+j+'] = '+j+(j===0?' (both empty strings)':' (insert '+j+' character'+(j>1?'s':'')+')')});
    }
    /* Phase 2: first column (skip 0,0 already done) */
    for(var i=1;i<=m;i++){
      D[i][0]=i;
      steps.push({type:'init-col',i:i,j:0,val:i,
        msg:'Initialize D['+i+'][0] = '+i+' (delete '+i+' character'+(i>1?'s':'')+')'});
    }
    /* Phase 3: fill interior */
    for(var i=1;i<=m;i++){
      for(var j=1;j<=n;j++){
        var delt=(x[i-1]!==y[j-1])?1:0;
        var vDiag=D[i-1][j-1]+delt;
        var vAbove=D[i-1][j]+1;
        var vLeft=D[i][j-1]+1;
        var best=Math.min(vDiag,vAbove,vLeft);
        D[i][j]=best;
        var matchStr=delt===0?'match (cost 0)':'substitution (cost 1)';
        var winners=[];
        if(vDiag===best) winners.push('diagonal');
        if(vAbove===best) winners.push('above');
        if(vLeft===best) winners.push('left');
        steps.push({type:'fill',i:i,j:j,val:best,
          diag:{i:i-1,j:j-1,val:vDiag},
          above:{i:i-1,j:j,val:vAbove},
          left:{i:i,j:j-1,val:vLeft},
          delt:delt,winners:winners,
          msg:'D['+i+']['+j+']: diagonal('+esc(x[i-1])+'/'+esc(y[j-1])+' '+matchStr+')='+vDiag+
              ', above(delete)='+vAbove+', left(insert)='+vLeft+
              '  \u2192 min = <b>'+best+'</b> via '+winners.join(', ')});
      }
    }
    /* Phase 4: answer */
    steps.push({type:'answer',i:m,j:n,val:D[m][n],
      msg:'Edit distance D['+m+']['+n+'] = <b>'+D[m][n]+'</b>. Next: traceback to recover the alignment.'});

    /* Phase 5: traceback */
    var tracePath=[];
    var ti=m, tj=n;
    while(ti>0||tj>0){
      tracePath.push({i:ti,j:tj});
      if(ti>0&&tj>0){
        var delt2=(x[ti-1]!==y[tj-1])?1:0;
        if(D[ti][tj]===D[ti-1][tj-1]+delt2){
          ti--;tj--;continue;
        }
      }
      if(ti>0&&D[ti][tj]===D[ti-1][tj]+1){
        ti--;continue;
      }
      tj--;
    }
    tracePath.push({i:0,j:0});

    /* build alignment strings along the traceback */
    var axArr=[], ayArr=[], midArr=[];
    for(var k=tracePath.length-1;k>0;k--){
      var ci=tracePath[k].i, cj=tracePath[k].j;
      var ni=tracePath[k-1].i, nj=tracePath[k-1].j;
      if(ni===ci+1&&nj===cj+1){
        /* diagonal: match or substitution */
        axArr.push(x[ci]); ayArr.push(y[cj]);
        midArr.push(x[ci]===y[cj]?'|':' ');
      }else if(ni===ci+1&&nj===cj){
        /* vertical: deletion from x */
        axArr.push(x[ci]); ayArr.push('-'); midArr.push(' ');
      }else{
        /* horizontal: insertion into x */
        axArr.push('-'); ayArr.push(y[cj]); midArr.push(' ');
      }
    }

    /* generate one step per traceback cell.
       k moves have been completed after step k (0=start cell, no moves yet).
       axArr has length tracePath.length-1; axArr is in forward order.
       After k moves, the last k alignment columns are revealed (from right end). */
    for(var k=0;k<tracePath.length;k++){
      /* k moves completed: reveal last k columns of alignment (indices axArr.length-k .. end) */
      var alignHtml='';
      if(k>0){
        var startIdx=axArr.length-k;
        var pAx=axArr.slice(startIdx).join('');
        var pMid=midArr.slice(startIdx).join('');
        var pAy=ayArr.slice(startIdx).join('');
        alignHtml='<div class="alignment-box">'+
          'x: '+esc(pAx)+'<br>'+
          '&nbsp;&nbsp; '+pMid.replace(/ /g,'&nbsp;')+'<br>'+
          'y: '+esc(pAy)+'</div>';
      }

      var pi=tracePath[k].i, pj=tracePath[k].j;
      var opMsg='';
      if(k===0){
        opMsg='<b>Traceback</b>: start at D['+pi+']['+pj+'] = '+D[pi][pj];
      }else{
        var prv=tracePath[k-1];
        if(pi===prv.i-1&&pj===prv.j-1){
          if(x[pi]===y[pj]) opMsg='\u2196 Match: '+esc(x[pi])+'='+esc(y[pj]);
          else opMsg='\u2196 Substitute: '+esc(x[pi])+'\u2192'+esc(y[pj]);
        }else if(pi===prv.i-1&&pj===prv.j){
          opMsg='\u2191 Delete '+esc(x[pi])+' from x';
        }else{
          opMsg='\u2190 Insert '+esc(y[pj])+' into x';
        }
        opMsg='<b>Traceback</b> at D['+pi+']['+pj+']: '+opMsg;
      }

      steps.push({type:'trace',idx:k,path:tracePath,
        msg:opMsg+alignHtml});
    }

    /* final traceback summary step */
    steps.push({type:'trace-done',path:tracePath,
      msg:'<b>Traceback complete!</b> Edit distance = <b>'+D[m][n]+'</b>'+
        '<div class="alignment-box">'+
        'x: '+esc(axArr.join(''))+'<br>'+
        '&nbsp;&nbsp; '+midArr.join('').replace(/ /g,'&nbsp;')+'<br>'+
        'y: '+esc(ayArr.join(''))+'</div>'});
  }

  function render(){
    /* determine which cells are filled */
    var filled={};
    var highlight={};
    for(var s=0;s<=curStep&&s<steps.length;s++){
      var st=steps[s];
      if(st.type==='fill'||st.type==='init-row'||st.type==='init-col'||st.type==='answer'){
        filled[st.i+','+st.j]=st.val;
      }
    }
    /* once we reach traceback or beyond, all cells are filled */
    var inTraceback=false;
    if(curStep>=0&&steps[curStep].type&&(steps[curStep].type==='trace'||steps[curStep].type==='trace-done')){
      inTraceback=true;
      for(var i=0;i<=m;i++) for(var j=0;j<=n;j++) filled[i+','+j]=D[i][j];
    }

    /* apply highlights for current step */
    if(curStep>=0&&curStep<steps.length){
      var st=steps[curStep];
      if(st.type==='fill'){
        highlight[st.diag.i+','+st.diag.j]='cell-diag';
        highlight[st.above.i+','+st.above.j]='cell-above';
        highlight[st.left.i+','+st.left.j]='cell-left';
        highlight[st.i+','+st.j]='cell-current';
      }else if(st.type==='answer'){
        highlight[st.i+','+st.j]='cell-answer';
      }else if(st.type==='trace'){
        /* highlight full path visited so far, current cell brighter */
        var path=st.path;
        for(var k=0;k<=st.idx;k++){
          var pk=path[k].i+','+path[k].j;
          highlight[pk]=(k===st.idx)?'cell-trace-cur':'cell-trace';
        }
      }else if(st.type==='trace-done'){
        var path=st.path;
        for(var k=0;k<path.length;k++){
          highlight[path[k].i+','+path[k].j]='cell-trace';
        }
        highlight[path[0].i+','+path[0].j]='cell-answer';
        highlight[path[path.length-1].i+','+path[path.length-1].j]='cell-answer';
      }else{
        highlight[st.i+','+st.j]='cell-current';
      }
    }

    var h='<table><tr><th>'+endTag('th')+'<th>-'+endTag('th');
    for(var j=0;j<n;j++) h+='<th>'+esc(y[j])+endTag('th');
    h+=endTag('tr');
    for(var i=0;i<=m;i++){
      h+='<tr><th>'+(i===0?'-':esc(x[i-1]))+endTag('th');
      for(var j=0;j<=n;j++){
        var key=i+','+j;
        var cls='cell-unfilled';
        if(key in filled){
          cls=highlight[key]||'cell-filled';
        }
        var val=(key in filled)?filled[key]:'';
        h+='<td class="'+cls+'">'+val+endTag('td');
      }
      h+=endTag('tr');
    }
    h+=endTag('table');
    gridDiv.innerHTML=h;

    /* info */
    if(curStep<0){
      infoDiv.innerHTML='Press <b>Next</b> or <b>Play</b> to begin.';
      stepInfo.textContent='Step 0 / '+steps.length;
    }else{
      infoDiv.innerHTML=steps[curStep].msg;
      stepInfo.textContent='Step '+(curStep+1)+' / '+steps.length;
    }
  }

  function init(){
    stopPlay();
    curStep=-1;
    buildSteps();
    render();
  }

  function stepFwd(){
    if(curStep<steps.length-1){curStep++;render();return true;}
    return false;
  }
  function stepBack(){
    if(curStep>=0){curStep--;render();}
  }

  function getDelay(){return 600-(speedSlider.value-1)*55;}

  function startPlay(){
    playing=true;btnPlay.textContent='Pause';
    function tick(){
      if(!playing)return;
      if(!stepFwd()){stopPlay();return;}
      timer=setTimeout(tick,getDelay());
    }
    tick();
  }
  function stopPlay(){playing=false;btnPlay.textContent='Play';if(timer){clearTimeout(timer);timer=null;}}

  btnNext.addEventListener('click',function(){stopPlay();stepFwd();});
  btnBack.addEventListener('click',function(){stopPlay();stepBack();});
  btnPlay.addEventListener('click',function(){if(playing){stopPlay();}else{startPlay();}});
  btnReset.addEventListener('click',init);

  init();
})();
</script>
''')
Step 0 / 0
Press Next or Play to begin.

Part 2: Global Alignment (Needleman-Wunsch)

Why this matters in biology: Global alignment is used whenever you need to compare two sequences end-to-end — for example, aligning orthologous genes across species to study evolutionary divergence, or comparing full-length protein sequences to infer function.

The biology behind substitutions and indels

Transitions vs transversions in DNA

Not all mutations are equally likely:

  • Transitions (purine↔︎purine: A↔︎G, or pyrimidine↔︎pyrimidine: C↔︎T) are chemically more similar changes and occur roughly twice as often as transversions (purine↔︎pyrimidine) in the human genome. In coding regions the ratio is even higher.
  • Insertions and deletions (indels) are less common than substitutions — a single-nucleotide indel occurs approximately every 1,000 bases, and a two-nucleotide deletion about every 3,000 bases (Montgomery et al. 2013).

These biological realities directly inform how we design scoring and penalty functions for alignment algorithms.

Edit distance treats all operations equally (cost = 1). But in biology, not all changes are equally likely — as we saw above, transitions are more common than transversions, and indels are rarer than substitutions.

Global alignment aligns two sequences end-to-end using a penalty/scoring matrix that reflects these biological realities.

The Needleman-Wunsch algorithm (1970) solves this using dynamic programming — it is essentially edit distance with a more sophisticated cost function.

Defining a penalty function

We define a cost function where: - Match = 0 (no penalty) - Transition = 2 - Transversion = 4 - Gap = 8

This is a penalty-based scheme (lower is better), unlike the score-based scheme we will use for local alignment later.

Where do these numbers come from?

Penalty functions are informed by three complementary criteria:

  1. Mutational frequency — penalties inversely related to how often a mutation type occurs in nature (transitions are ~4× more common than expected by chance, so they are penalized less)
  2. Biochemical interchangeability — residues that serve similar roles (e.g., both hydrophobic) incur lower penalties when swapped
  3. Structural impact — substitutions that preserve 3D fold and function cost less than those that disrupt them

For DNA, the 0/2/4/8 scheme reflects the empirical observation that substitutions occur ~1 in 1,000 bases while indels occur ~1 in 3,000 bases (so gaps cost ~3× more), and transitions are more common than transversions.

For protein sequences, the widely used BLOSUM62 matrix encodes log-odds scores for all 20×20 amino acid substitutions, derived from observed substitution patterns in conserved protein blocks. Positive values mean the substitution is more common than expected (benign), while negative values indicate it is rare (disruptive).

def penalty(xc, yc):
    """Penalty function for global alignment.
    Match=0, transition=2, transversion=4, gap=8."""
    if xc == yc: 
        return 0                          # match
    if xc == '-' or yc == '-': 
        return 8                          # gap
    minc, maxc = min(xc, yc), max(xc, yc)
    if (minc == 'A' and maxc == 'G') or (minc == 'C' and maxc == 'T'):
        return 2                          # transition
    return 4                              # transversion

# Demonstrate
print(f"A vs A (match):       {penalty('A', 'A')}")
print(f"A vs G (transition):  {penalty('A', 'G')}")
print(f"A vs C (transversion):{penalty('A', 'C')}")
print(f"A vs - (gap):         {penalty('A', '-')}")
A vs A (match):       0
A vs G (transition):  2
A vs C (transversion):4
A vs - (gap):         8

The Needleman-Wunsch algorithm

The algorithm is almost identical to edit distance DP, but uses our custom penalty function instead of a uniform cost of 1:

\[D[i][j] = \min \begin{cases} D[i-1][j-1] + \text{penalty}(x[i-1], y[j-1]) & \text{(match/mismatch)} \\ D[i-1][j] + \text{penalty}(x[i-1], \texttt{'-'}) & \text{(gap in } y\text{)} \\ D[i][j-1] + \text{penalty}(\texttt{'-'}, y[j-1]) & \text{(gap in } x\text{)} \end{cases}\]

The optimal global alignment score is in \(D[m][n]\).

from IPython.display import HTML

HTML('''
<style>
.nw-anim-wrap{font-family:system-ui,sans-serif;max-width:960px}
.nw-anim-wrap table{border-collapse:collapse;margin:10px 0}
.nw-anim-wrap td,.nw-anim-wrap th{width:38px;height:32px;text-align:center;border:1px solid #999;font-size:13px;padding:2px}
.nw-anim-wrap th{background:#e8e8e8;font-weight:600}
.nw-anim-wrap .cell-unfilled{background:#eee;color:#ccc}
.nw-anim-wrap .cell-filled{background:#fff}
.nw-anim-wrap .cell-current{background:#ffd54f;font-weight:700}
.nw-anim-wrap .cell-diag{background:#90caf9}
.nw-anim-wrap .cell-above{background:#a5d6a7}
.nw-anim-wrap .cell-left{background:#ef9a9a}
.nw-anim-wrap .cell-answer{background:#66bb6a;color:#fff;font-weight:700}
.nw-anim-wrap .cell-trace{background:#ce93d8;font-weight:700}
.nw-anim-wrap .cell-trace-cur{background:#ab47bc;color:#fff;font-weight:700}
.nw-anim-wrap .controls{margin:8px 0;display:flex;align-items:center;gap:8px;flex-wrap:wrap}
.nw-anim-wrap button{padding:4px 14px;font-size:14px;cursor:pointer;border:1px solid #888;border-radius:4px;background:#f5f5f5}
.nw-anim-wrap button:hover{background:#e0e0e0}
.nw-anim-wrap .info-panel{background:#f9f9f9;border:1px solid #ddd;border-radius:6px;padding:10px 14px;margin-top:8px;font-size:13px;line-height:1.6;min-height:60px}
.nw-anim-wrap label{font-size:13px}
.nw-anim-wrap input[type=text]{padding:3px 6px;font-size:13px;width:110px;font-family:monospace}
.nw-anim-wrap input[type=range]{width:100px}
.nw-anim-wrap .step-counter{font-size:13px;color:#555;min-width:90px}
.nw-anim-wrap .alignment-box{font-family:monospace;font-size:14px;margin-top:6px;line-height:1.4}
.nw-anim-wrap .legend{font-size:12px;color:#555;margin-bottom:6px}
.nw-anim-wrap .legend span{display:inline-block;width:14px;height:14px;vertical-align:middle;margin:0 3px 0 8px;border:1px solid #999;border-radius:2px}
</style>
<div class="nw-anim-wrap" id="nwAnimRoot">
  <div style="margin-bottom:8px">
    <label>x: <input type="text" id="nwInpX" value="TACGTCAGC"></label>
    <label style="margin-left:8px">y: <input type="text" id="nwInpY" value="TATGTCATGC"></label>
    <button id="nwBtnReset" style="margin-left:8px">Reset</button>
  </div>
  <div class="legend">Penalties: match=0, transition(A/G, C/T)=2, transversion=4, gap=8</div>
  <div id="nwGrid"></div>
  <div class="controls">
    <button id="nwBtnBack">Back</button>
    <button id="nwBtnNext">Next</button>
    <button id="nwBtnPlay">Play</button>
    <span class="step-counter" id="nwStepInfo">Step 0 / 0</span>
    <label>Speed: <input type="range" id="nwSpeed" min="1" max="10" value="5"></label>
  </div>
  <div class="info-panel" id="nwInfo">Press <b>Next</b> or <b>Play</b> to begin.</div>
</div>
<script>
(function(){
  function endTag(t){return '<'+'/'+t+'>';}
  function esc(s){return s.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;');}

  var gridDiv=document.getElementById('nwGrid');
  var infoDiv=document.getElementById('nwInfo');
  var stepInfo=document.getElementById('nwStepInfo');
  var inpX=document.getElementById('nwInpX');
  var inpY=document.getElementById('nwInpY');
  var btnBack=document.getElementById('nwBtnBack');
  var btnNext=document.getElementById('nwBtnNext');
  var btnPlay=document.getElementById('nwBtnPlay');
  var btnReset=document.getElementById('nwBtnReset');
  var speedSlider=document.getElementById('nwSpeed');

  var steps=[], curStep=-1, D=[], x='', y='', m=0, n=0;
  var playing=false, timer=null;

  /* Needleman-Wunsch penalty function: match=0, transition=2, transversion=4, gap=8 */
  function pen(a,b){
    if(a===b) return 0;
    if(a==='-'||b==='-') return 8;
    var lo=a<b?a:b, hi=a<b?b:a;
    if((lo==='A'&&hi==='G')||(lo==='C'&&hi==='T')) return 2;
    return 4;
  }
  function penLabel(a,b){
    if(a===b) return 'match';
    if(a==='-'||b==='-') return 'gap';
    var lo=a<b?a:b, hi=a<b?b:a;
    if((lo==='A'&&hi==='G')||(lo==='C'&&hi==='T')) return 'transition';
    return 'transversion';
  }

  function buildSteps(){
    x=inpX.value.toUpperCase(); y=inpY.value.toUpperCase();
    m=x.length; n=y.length;
    D=[];
    for(var i=0;i<=m;i++){D[i]=[];for(var j=0;j<=n;j++) D[i][j]=null;}
    steps=[];

    /* Phase 1: first row — cumulative gap penalties */
    D[0][0]=0;
    steps.push({type:'init-row',i:0,j:0,val:0,
      msg:'Initialize D[0][0] = 0 (both empty strings)'});
    for(var j=1;j<=n;j++){
      D[0][j]=D[0][j-1]+pen('-',y[j-1]);
      steps.push({type:'init-row',i:0,j:j,val:D[0][j],
        msg:'Initialize D[0]['+j+'] = D[0]['+(j-1)+'] + gap('+esc(y[j-1])+') = '+D[0][j-1]+' + 8 = '+D[0][j]});
    }
    /* Phase 2: first column */
    for(var i=1;i<=m;i++){
      D[i][0]=D[i-1][0]+pen(x[i-1],'-');
      steps.push({type:'init-col',i:i,j:0,val:D[i][0],
        msg:'Initialize D['+i+'][0] = D['+(i-1)+'][0] + gap('+esc(x[i-1])+') = '+D[i-1][0]+' + 8 = '+D[i][0]});
    }
    /* Phase 3: fill interior */
    for(var i=1;i<=m;i++){
      for(var j=1;j<=n;j++){
        var pDiag=pen(x[i-1],y[j-1]);
        var pAbove=pen(x[i-1],'-');
        var pLeft=pen('-',y[j-1]);
        var vDiag=D[i-1][j-1]+pDiag;
        var vAbove=D[i-1][j]+pAbove;
        var vLeft=D[i][j-1]+pLeft;
        var best=Math.min(vDiag,vAbove,vLeft);
        D[i][j]=best;
        var diagDesc=esc(x[i-1])+'/'+esc(y[j-1])+' '+penLabel(x[i-1],y[j-1])+'='+pDiag;
        var winners=[];
        if(vDiag===best) winners.push('diagonal');
        if(vAbove===best) winners.push('above');
        if(vLeft===best) winners.push('left');
        steps.push({type:'fill',i:i,j:j,val:best,
          diag:{i:i-1,j:j-1,val:vDiag},
          above:{i:i-1,j:j,val:vAbove},
          left:{i:i,j:j-1,val:vLeft},
          msg:'D['+i+']['+j+']: diag('+diagDesc+')='+vDiag+
              ', above(gap)='+vAbove+', left(gap)='+vLeft+
              ' \u2192 min = <b>'+best+'</b> via '+winners.join(', ')});
      }
    }
    /* Phase 4: answer */
    steps.push({type:'answer',i:m,j:n,val:D[m][n],
      msg:'Global alignment penalty D['+m+']['+n+'] = <b>'+D[m][n]+'</b>. Next: traceback to recover the alignment.'});

    /* Phase 5: traceback */
    var tracePath=[];
    var ti=m, tj=n;
    while(ti>0||tj>0){
      tracePath.push({i:ti,j:tj});
      if(ti>0&&tj>0){
        if(D[ti][tj]===D[ti-1][tj-1]+pen(x[ti-1],y[tj-1])){
          ti--;tj--;continue;
        }
      }
      if(ti>0&&D[ti][tj]===D[ti-1][tj]+pen(x[ti-1],'-')){
        ti--;continue;
      }
      tj--;
    }
    tracePath.push({i:0,j:0});

    /* build alignment strings */
    var axArr=[], ayArr=[], midArr=[];
    for(var k=tracePath.length-1;k>0;k--){
      var ci=tracePath[k].i, cj=tracePath[k].j;
      var ni=tracePath[k-1].i, nj=tracePath[k-1].j;
      if(ni===ci+1&&nj===cj+1){
        axArr.push(x[ci]); ayArr.push(y[cj]);
        midArr.push(x[ci]===y[cj]?'|':' ');
      }else if(ni===ci+1&&nj===cj){
        axArr.push(x[ci]); ayArr.push('-'); midArr.push(' ');
      }else{
        axArr.push('-'); ayArr.push(y[cj]); midArr.push(' ');
      }
    }

    /* traceback steps */
    for(var k=0;k<tracePath.length;k++){
      var alignHtml='';
      if(k>0){
        var startIdx=axArr.length-k;
        var pAx=axArr.slice(startIdx).join('');
        var pMid=midArr.slice(startIdx).join('');
        var pAy=ayArr.slice(startIdx).join('');
        alignHtml='<div class="alignment-box">'+
          'x: '+esc(pAx)+'<br>'+
          '&nbsp;&nbsp; '+pMid.replace(/ /g,'&nbsp;')+'<br>'+
          'y: '+esc(pAy)+'</div>';
      }
      var pi=tracePath[k].i, pj=tracePath[k].j;
      var opMsg='';
      if(k===0){
        opMsg='<b>Traceback</b>: start at D['+pi+']['+pj+'] = '+D[pi][pj];
      }else{
        var prv=tracePath[k-1];
        if(pi===prv.i-1&&pj===prv.j-1){
          var lbl=penLabel(x[pi],y[pj]);
          if(lbl==='match') opMsg='\u2196 Match: '+esc(x[pi])+'='+esc(y[pj]);
          else opMsg='\u2196 '+lbl.charAt(0).toUpperCase()+lbl.slice(1)+': '+esc(x[pi])+'\u2192'+esc(y[pj])+' (cost '+pen(x[pi],y[pj])+')';
        }else if(pi===prv.i-1&&pj===prv.j){
          opMsg='\u2191 Gap in y: delete '+esc(x[pi])+' (cost 8)';
        }else{
          opMsg='\u2190 Gap in x: insert '+esc(y[pj])+' (cost 8)';
        }
        opMsg='<b>Traceback</b> at D['+pi+']['+pj+']: '+opMsg;
      }
      steps.push({type:'trace',idx:k,path:tracePath,
        msg:opMsg+alignHtml});
    }

    /* final summary */
    steps.push({type:'trace-done',path:tracePath,
      msg:'<b>Traceback complete!</b> Alignment penalty = <b>'+D[m][n]+'</b>'+
        '<div class="alignment-box">'+
        'x: '+esc(axArr.join(''))+'<br>'+
        '&nbsp;&nbsp; '+midArr.join('').replace(/ /g,'&nbsp;')+'<br>'+
        'y: '+esc(ayArr.join(''))+'</div>'});
  }

  function render(){
    var filled={};
    var highlight={};
    for(var s=0;s<=curStep&&s<steps.length;s++){
      var st=steps[s];
      if(st.type==='fill'||st.type==='init-row'||st.type==='init-col'||st.type==='answer'){
        filled[st.i+','+st.j]=st.val;
      }
    }
    if(curStep>=0&&steps[curStep].type&&(steps[curStep].type==='trace'||steps[curStep].type==='trace-done')){
      for(var i=0;i<=m;i++) for(var j=0;j<=n;j++) filled[i+','+j]=D[i][j];
    }

    if(curStep>=0&&curStep<steps.length){
      var st=steps[curStep];
      if(st.type==='fill'){
        highlight[st.diag.i+','+st.diag.j]='cell-diag';
        highlight[st.above.i+','+st.above.j]='cell-above';
        highlight[st.left.i+','+st.left.j]='cell-left';
        highlight[st.i+','+st.j]='cell-current';
      }else if(st.type==='answer'){
        highlight[st.i+','+st.j]='cell-answer';
      }else if(st.type==='trace'){
        var path=st.path;
        for(var k=0;k<=st.idx;k++){
          var pk=path[k].i+','+path[k].j;
          highlight[pk]=(k===st.idx)?'cell-trace-cur':'cell-trace';
        }
      }else if(st.type==='trace-done'){
        var path=st.path;
        for(var k=0;k<path.length;k++){
          highlight[path[k].i+','+path[k].j]='cell-trace';
        }
        highlight[path[0].i+','+path[0].j]='cell-answer';
        highlight[path[path.length-1].i+','+path[path.length-1].j]='cell-answer';
      }else{
        highlight[st.i+','+st.j]='cell-current';
      }
    }

    var h='<table><tr><th>'+endTag('th')+'<th>-'+endTag('th');
    for(var j=0;j<n;j++) h+='<th>'+esc(y[j])+endTag('th');
    h+=endTag('tr');
    for(var i=0;i<=m;i++){
      h+='<tr><th>'+(i===0?'-':esc(x[i-1]))+endTag('th');
      for(var j=0;j<=n;j++){
        var key=i+','+j;
        var cls='cell-unfilled';
        if(key in filled){
          cls=highlight[key]||'cell-filled';
        }
        var val=(key in filled)?filled[key]:'';
        h+='<td class="'+cls+'">'+val+endTag('td');
      }
      h+=endTag('tr');
    }
    h+=endTag('table');
    gridDiv.innerHTML=h;

    if(curStep<0){
      infoDiv.innerHTML='Press <b>Next</b> or <b>Play</b> to begin.';
      stepInfo.textContent='Step 0 / '+steps.length;
    }else{
      infoDiv.innerHTML=steps[curStep].msg;
      stepInfo.textContent='Step '+(curStep+1)+' / '+steps.length;
    }
  }

  function init(){
    stopPlay();
    curStep=-1;
    buildSteps();
    render();
  }
  function stepFwd(){
    if(curStep<steps.length-1){curStep++;render();return true;}
    return false;
  }
  function stepBack(){
    if(curStep>=0){curStep--;render();}
  }
  function getDelay(){return 600-(speedSlider.value-1)*55;}
  function startPlay(){
    playing=true;btnPlay.textContent='Pause';
    function tick(){
      if(!playing)return;
      if(!stepFwd()){stopPlay();return;}
      timer=setTimeout(tick,getDelay());
    }
    tick();
  }
  function stopPlay(){playing=false;btnPlay.textContent='Play';if(timer){clearTimeout(timer);timer=null;}}

  btnNext.addEventListener('click',function(){stopPlay();stepFwd();});
  btnBack.addEventListener('click',function(){stopPlay();stepBack();});
  btnPlay.addEventListener('click',function(){if(playing){stopPlay();}else{startPlay();}});
  btnReset.addEventListener('click',init);

  init();
})();
</script>
''')
Penalties: match=0, transition(A/G, C/T)=2, transversion=4, gap=8
Step 0 / 0
Press Next or Play to begin.
def global_alignment(x, y, s):
    """Needleman-Wunsch global alignment.
    
    Args:
        x, y: sequences to align
        s: penalty function s(char1, char2) -> cost
    
    Returns:
        D: the filled DP matrix
        score: the optimal alignment score (bottom-right cell)
    """
    D = np.zeros((len(x) + 1, len(y) + 1), dtype=int)
    
    # Initialize first row and column with gap penalties
    for j in range(1, len(y) + 1):
        D[0, j] = D[0, j-1] + s('-', y[j-1])
    for i in range(1, len(x) + 1):
        D[i, 0] = D[i-1, 0] + s(x[i-1], '-')
    
    # Fill the matrix
    for i in range(1, len(x) + 1):
        for j in range(1, len(y) + 1):
            D[i, j] = min(
                D[i-1, j-1] + s(x[i-1], y[j-1]),  # diagonal
                D[i-1, j]   + s(x[i-1], '-'),      # vertical (gap in y)
                D[i, j-1]   + s('-', y[j-1])       # horizontal (gap in x)
            )
    
    return D, D[len(x), len(y)]
import time

x = 'TACGTCAGC'
y = 'TATGTCATGC'

t0 = time.time()
D, score = global_alignment(x, y, penalty)
elapsed = time.time() - t0
print(f"Global alignment penalty for '{x}' vs '{y}': {score}")
print(f"The penalty of {score} represents the best possible end-to-end alignment.")
print(f"The low penalty is driven by the many matches — explore the interactive matrix below.")
print(f"Time: {elapsed:.6f}s")
Global alignment penalty for 'TACGTCAGC' vs 'TATGTCATGC': 10
The penalty of 10 represents the best possible end-to-end alignment.
The low penalty is driven by the many matches — explore the interactive matrix below.
Time: 0.000163s

Traceback: recovering the alignment

The DP matrix tells us the score, but not the actual alignment. To recover the alignment, we trace back from \(D[m][n]\) to \(D[0][0]\), following the path that produced each cell’s value.

import time

def global_alignment_traceback(D, x, y, s):
    """Traceback through a global alignment DP matrix to recover the alignment."""
    i, j = len(x), len(y)
    aligned_x, aligned_y, midline = [], [], []
    
    while i > 0 or j > 0:
        if i > 0 and j > 0 and D[i, j] == D[i-1, j-1] + s(x[i-1], y[j-1]):
            # Diagonal: match or substitution
            aligned_x.append(x[i-1])
            aligned_y.append(y[j-1])
            midline.append('|' if x[i-1] == y[j-1] else ' ')
            i -= 1; j -= 1
        elif i > 0 and D[i, j] == D[i-1, j] + s(x[i-1], '-'):
            # Vertical: gap in y
            aligned_x.append(x[i-1])
            aligned_y.append('-')
            midline.append(' ')
            i -= 1
        else:
            # Horizontal: gap in x
            aligned_x.append('-')
            aligned_y.append(y[j-1])
            midline.append(' ')
            j -= 1
    
    # Reverse (we traced back from end to start)
    aligned_x = ''.join(reversed(aligned_x))
    aligned_y = ''.join(reversed(aligned_y))
    midline   = ''.join(reversed(midline))
    
    return aligned_x, midline, aligned_y

t0 = time.time()
ax, mid, ay = global_alignment_traceback(D, x, y, penalty)
elapsed = time.time() - t0
print(f"  x: {ax}")
print(f"     {mid}")
print(f"  y: {ay}")
print(f"Time: {elapsed:.6f}s")
  x: TACGTCA-GC
     || |||| ||
  y: TATGTCATGC
Time: 0.000065s

The | symbols mark matching positions. Gaps (-) represent insertions or deletions.

Try it: a longer example

import time

t0 = time.time()
seq1 = 'ATAGACGACATACAGACAGCATACAGACAGCATACAGA'
seq2 = 'TTTAGCATGCGCATATCAGCAATACAGCAGATACG'

D, score = global_alignment(seq1, seq2, penalty)
ax, mid, ay = global_alignment_traceback(D, seq1, seq2, penalty)

print(f"Penalty: {score}\n")
print(f"  x: {ax}")
print(f"     {mid}")
print(f"  y: {ay}")
print(f"\nTime: {time.time()-t0:.6f}s")
Penalty: 74

  x: ATAGACGACATACAGACAGCATACAGACAGCATACAGA
      |  | | ||| |  | | ||   | |||||| | |  
  y: -TTTA-G-CATGCGCATATCAGCAATACAGCAGATACG

Time: 0.001854s

Part 3: Local Alignment (Smith-Waterman)

Why this matters in biology: Local alignment is what BLAST does — it finds the best matching region between two sequences, ignoring poorly matching flanks. This is essential for finding conserved domains in proteins, identifying homologous regions between divergent genomes, and mapping reads that only partially overlap a reference.

Global alignment forces the entire length of both sequences to participate in the alignment. But what if we have two long sequences that share only a short region of high similarity?

Local alignment finds the best matching subsequences within two longer sequences. The Smith-Waterman algorithm (1981) is the standard method.

Key differences from global alignment

Feature Global (Needleman-Wunsch) Local (Smith-Waterman)
Aligns Entire sequences end-to-end Best matching subregions
Scoring Penalties (lower is better) Rewards (higher is better)
Matrix init First row/column = gap costs First row/column = 0
Cell minimum Can be negative Clamped to 0 (reset)
Answer location Bottom-right cell Maximum value anywhere in matrix
Traceback From bottom-right to top-left From max cell until reaching 0

Scoring function

For local alignment, we use a reward-based scoring scheme (higher is better): - Match = +2 (reward) - Mismatch = -4 (penalty) - Gap = -6 (penalty)

The positive match reward is critical — it allows good regions to accumulate a high score.

Why must matches be positive and edits negative?

The zero floor in Smith-Waterman only works if: - Matches push the score above zero (positive bonus) - Mismatches and gaps push it below zero (negative penalty)

If matches were not positive, all cells would equal 0 — the floor would always dominate. If edits were not negative, the floor would never be reached and the algorithm would degenerate to global alignment. The zeros in the matrix act as a “background level” that allows peaks of similarity to rise above.

Note on duality: Penalty-based (minimize) and reward-based (maximize) scoring produce identical tracebacks for global alignment — they are equivalent formulations related by a sign flip plus constant. However, local alignment requires the reward-based formulation so the zero-floor reset is meaningful.

def local_score(xc, yc):
    """Scoring function for local alignment.
    Match=+2, mismatch=-4, gap=-6."""
    if xc == yc:
        return 2    # match reward
    if xc == '-' or yc == '-':
        return -6   # gap penalty
    return -4       # mismatch penalty

The Smith-Waterman algorithm

The recurrence is:

\[V[i][j] = \max \begin{cases} V[i-1][j-1] + s(x[i-1], y[j-1]) & \text{(diagonal)} \\ V[i-1][j] + s(x[i-1], \texttt{'-'}) & \text{(vertical)} \\ V[i][j-1] + s(\texttt{'-'}, y[j-1]) & \text{(horizontal)} \\ 0 & \text{(reset — start a new alignment)} \end{cases}\]

The 0 option is the key innovation: it allows the algorithm to abandon a poor alignment and start fresh, thereby finding local regions of high similarity.

from IPython.display import HTML

HTML('''
<style>
.sw-anim-wrap{font-family:system-ui,sans-serif;max-width:960px}
.sw-anim-wrap table{border-collapse:collapse;margin:10px 0}
.sw-anim-wrap td,.sw-anim-wrap th{width:34px;height:30px;text-align:center;border:1px solid #999;font-size:12px;padding:1px}
.sw-anim-wrap th{background:#e8e8e8;font-weight:600}
.sw-anim-wrap .cell-unfilled{background:#eee;color:#ccc}
.sw-anim-wrap .cell-filled{background:#fff}
.sw-anim-wrap .cell-zero{background:#f5f5f5;color:#aaa}
.sw-anim-wrap .cell-current{background:#ffd54f;font-weight:700}
.sw-anim-wrap .cell-diag{background:#90caf9}
.sw-anim-wrap .cell-above{background:#a5d6a7}
.sw-anim-wrap .cell-left{background:#ef9a9a}
.sw-anim-wrap .cell-answer{background:#66bb6a;color:#fff;font-weight:700}
.sw-anim-wrap .cell-trace{background:#ce93d8;font-weight:700}
.sw-anim-wrap .cell-trace-cur{background:#ab47bc;color:#fff;font-weight:700}
.sw-anim-wrap .controls{margin:8px 0;display:flex;align-items:center;gap:8px;flex-wrap:wrap}
.sw-anim-wrap button{padding:4px 14px;font-size:14px;cursor:pointer;border:1px solid #888;border-radius:4px;background:#f5f5f5}
.sw-anim-wrap button:hover{background:#e0e0e0}
.sw-anim-wrap .info-panel{background:#f9f9f9;border:1px solid #ddd;border-radius:6px;padding:10px 14px;margin-top:8px;font-size:13px;line-height:1.6;min-height:60px}
.sw-anim-wrap label{font-size:13px}
.sw-anim-wrap input[type=text]{padding:3px 6px;font-size:13px;width:130px;font-family:monospace}
.sw-anim-wrap input[type=range]{width:100px}
.sw-anim-wrap .step-counter{font-size:13px;color:#555;min-width:90px}
.sw-anim-wrap .alignment-box{font-family:monospace;font-size:14px;margin-top:6px;line-height:1.4}
.sw-anim-wrap .legend{font-size:12px;color:#555;margin-bottom:6px}
</style>
<div class="sw-anim-wrap" id="swAnimRoot">
  <div style="margin-bottom:8px">
    <label>x: <input type="text" id="swInpX" value="GGTATGCTGGCGCTA"></label>
    <label style="margin-left:8px">y: <input type="text" id="swInpY" value="TATATGCGGCGTTT"></label>
    <button id="swBtnReset" style="margin-left:8px">Reset</button>
  </div>
  <div class="legend">Scores: match=+2, mismatch=-4, gap=-6 | Cell minimum clamped to 0</div>
  <div id="swGrid"></div>
  <div class="controls">
    <button id="swBtnBack">Back</button>
    <button id="swBtnNext">Next</button>
    <button id="swBtnPlay">Play</button>
    <span class="step-counter" id="swStepInfo">Step 0 / 0</span>
    <label>Speed: <input type="range" id="swSpeed" min="1" max="10" value="5"></label>
  </div>
  <div class="info-panel" id="swInfo">Press <b>Next</b> or <b>Play</b> to begin.</div>
</div>
<script>
(function(){
  function endTag(t){return '<'+'/'+t+'>';}
  function esc(s){return s.replace(/&/g,'&amp;').replace(/</g,'&lt;').replace(/>/g,'&gt;');}

  var gridDiv=document.getElementById('swGrid');
  var infoDiv=document.getElementById('swInfo');
  var stepInfo=document.getElementById('swStepInfo');
  var inpX=document.getElementById('swInpX');
  var inpY=document.getElementById('swInpY');
  var btnBack=document.getElementById('swBtnBack');
  var btnNext=document.getElementById('swBtnNext');
  var btnPlay=document.getElementById('swBtnPlay');
  var btnReset=document.getElementById('swBtnReset');
  var speedSlider=document.getElementById('swSpeed');

  var steps=[], curStep=-1, V=[], x='', y='', m=0, n=0;
  var playing=false, timer=null;

  /* Smith-Waterman scoring: match=+2, mismatch=-4, gap=-6 */
  function sc(a,b){
    if(a===b) return 2;
    if(a==='-'||b==='-') return -6;
    return -4;
  }

  function buildSteps(){
    x=inpX.value.toUpperCase(); y=inpY.value.toUpperCase();
    m=x.length; n=y.length;
    V=[];
    for(var i=0;i<=m;i++){V[i]=[];for(var j=0;j<=n;j++) V[i][j]=0;}
    steps=[];

    /* Phase 1: first row — all zeros */
    for(var j=0;j<=n;j++){
      steps.push({type:'init-row',i:0,j:j,val:0,
        msg:'Initialize V[0]['+j+'] = 0'+(j===0?' (local alignment: first row and column are all 0)':'')});
    }
    /* Phase 2: first column (skip 0,0) */
    for(var i=1;i<=m;i++){
      steps.push({type:'init-col',i:i,j:0,val:0,
        msg:'Initialize V['+i+'][0] = 0'});
    }

    /* Phase 3: fill interior */
    var maxVal=0, maxI=0, maxJ=0;
    for(var i=1;i<=m;i++){
      for(var j=1;j<=n;j++){
        var sDiag=sc(x[i-1],y[j-1]);
        var vDiag=V[i-1][j-1]+sDiag;
        var vAbove=V[i-1][j]+sc(x[i-1],'-');
        var vLeft=V[i][j-1]+sc('-',y[j-1]);
        var best=Math.max(vDiag,vAbove,vLeft,0);
        V[i][j]=best;
        if(best>maxVal){maxVal=best;maxI=i;maxJ=j;}

        var diagLabel=esc(x[i-1])+'/'+esc(y[j-1])+(x[i-1]===y[j-1]?' match':' mismatch');
        var winners=[];
        if(best===0&&vDiag<0&&vAbove<0&&vLeft<0) winners.push('reset (0)');
        else{
          if(vDiag===best) winners.push('diagonal');
          if(vAbove===best) winners.push('above');
          if(vLeft===best) winners.push('left');
          if(best===0) winners.push('reset (0)');
        }
        var cands='diag('+diagLabel+'='+vDiag+'), above(gap='+vAbove+'), left(gap='+vLeft+'), reset=0';
        steps.push({type:'fill',i:i,j:j,val:best,
          diag:{i:i-1,j:j-1,val:vDiag},
          above:{i:i-1,j:j,val:vAbove},
          left:{i:i,j:j-1,val:vLeft},
          msg:'V['+i+']['+j+']: '+cands+' \u2192 max = <b>'+best+'</b> via '+winners.join(', ')});
      }
    }

    /* Phase 4: highlight answer — max cell */
    steps.push({type:'answer',i:maxI,j:maxJ,val:maxVal,
      msg:'Best local alignment score: <b>'+maxVal+'</b> at V['+maxI+']['+maxJ+']. Next: traceback until we reach 0.'});

    /* Phase 5: traceback from max cell until V=0 */
    var tracePath=[];
    var ti=maxI, tj=maxJ;
    while(ti>0&&tj>0&&V[ti][tj]!==0){
      tracePath.push({i:ti,j:tj});
      var dg=V[ti-1][tj-1]+sc(x[ti-1],y[tj-1]);
      var ab=V[ti-1][tj]+sc(x[ti-1],'-');
      var lt=V[ti][tj-1]+sc('-',y[tj-1]);
      if(dg>=ab&&dg>=lt){
        ti--;tj--;
      }else if(ab>=lt){
        ti--;
      }else{
        tj--;
      }
    }
    tracePath.push({i:ti,j:tj});

    /* build alignment strings */
    var axArr=[], ayArr=[], midArr=[];
    for(var k=tracePath.length-1;k>0;k--){
      var ci=tracePath[k].i, cj=tracePath[k].j;
      var ni=tracePath[k-1].i, nj=tracePath[k-1].j;
      if(ni===ci+1&&nj===cj+1){
        axArr.push(x[ci]); ayArr.push(y[cj]);
        midArr.push(x[ci]===y[cj]?'|':' ');
      }else if(ni===ci+1&&nj===cj){
        axArr.push(x[ci]); ayArr.push('-'); midArr.push(' ');
      }else{
        axArr.push('-'); ayArr.push(y[cj]); midArr.push(' ');
      }
    }

    /* traceback steps */
    for(var k=0;k<tracePath.length;k++){
      var alignHtml='';
      if(k>0){
        var startIdx=axArr.length-k;
        var pAx=axArr.slice(startIdx).join('');
        var pMid=midArr.slice(startIdx).join('');
        var pAy=ayArr.slice(startIdx).join('');
        alignHtml='<div class="alignment-box">'+
          'x: '+esc(pAx)+'<br>'+
          '&nbsp;&nbsp; '+pMid.replace(/ /g,'&nbsp;')+'<br>'+
          'y: '+esc(pAy)+'</div>';
      }
      var pi=tracePath[k].i, pj=tracePath[k].j;
      var opMsg='';
      if(k===0){
        opMsg='<b>Traceback</b>: start at max cell V['+pi+']['+pj+'] = '+V[pi][pj];
      }else{
        var prv=tracePath[k-1];
        if(pi===prv.i-1&&pj===prv.j-1){
          if(x[pi]===y[pj]) opMsg='\u2196 Match: '+esc(x[pi])+'='+esc(y[pj])+' (+2)';
          else opMsg='\u2196 Mismatch: '+esc(x[pi])+'/'+esc(y[pj])+' (-4)';
        }else if(pi===prv.i-1&&pj===prv.j){
          opMsg='\u2191 Gap in y: '+esc(x[pi])+'/- (-6)';
        }else{
          opMsg='\u2190 Gap in x: -/'+esc(y[pj])+' (-6)';
        }
        if(V[pi][pj]===0) opMsg+='. Reached 0 \u2014 traceback ends.';
        opMsg='<b>Traceback</b> at V['+pi+']['+pj+'] = '+V[pi][pj]+': '+opMsg;
      }
      steps.push({type:'trace',idx:k,path:tracePath,
        msg:opMsg+alignHtml});
    }

    /* final summary */
    steps.push({type:'trace-done',path:tracePath,maxI:maxI,maxJ:maxJ,
      msg:'<b>Traceback complete!</b> Best local alignment score = <b>'+maxVal+'</b>'+
        '<div class="alignment-box">'+
        'x: '+esc(axArr.join(''))+'<br>'+
        '&nbsp;&nbsp; '+midArr.join('').replace(/ /g,'&nbsp;')+'<br>'+
        'y: '+esc(ayArr.join(''))+'</div>'});
  }

  function render(){
    var filled={};
    var highlight={};
    for(var s=0;s<=curStep&&s<steps.length;s++){
      var st=steps[s];
      if(st.type==='fill'||st.type==='init-row'||st.type==='init-col'||st.type==='answer'){
        filled[st.i+','+st.j]=st.val;
      }
    }
    if(curStep>=0&&steps[curStep].type&&(steps[curStep].type==='trace'||steps[curStep].type==='trace-done')){
      for(var i=0;i<=m;i++) for(var j=0;j<=n;j++) filled[i+','+j]=V[i][j];
    }

    if(curStep>=0&&curStep<steps.length){
      var st=steps[curStep];
      if(st.type==='fill'){
        highlight[st.diag.i+','+st.diag.j]='cell-diag';
        highlight[st.above.i+','+st.above.j]='cell-above';
        highlight[st.left.i+','+st.left.j]='cell-left';
        highlight[st.i+','+st.j]='cell-current';
      }else if(st.type==='answer'){
        highlight[st.i+','+st.j]='cell-answer';
      }else if(st.type==='trace'){
        var path=st.path;
        for(var k=0;k<=st.idx;k++){
          var pk=path[k].i+','+path[k].j;
          highlight[pk]=(k===st.idx)?'cell-trace-cur':'cell-trace';
        }
      }else if(st.type==='trace-done'){
        var path=st.path;
        for(var k=0;k<path.length;k++){
          highlight[path[k].i+','+path[k].j]='cell-trace';
        }
        highlight[path[0].i+','+path[0].j]='cell-answer';
        highlight[path[path.length-1].i+','+path[path.length-1].j]='cell-answer';
      }else{
        highlight[st.i+','+st.j]='cell-current';
      }
    }

    var h='<table><tr><th>'+endTag('th')+'<th>-'+endTag('th');
    for(var j=0;j<n;j++) h+='<th>'+esc(y[j])+endTag('th');
    h+=endTag('tr');
    for(var i=0;i<=m;i++){
      h+='<tr><th>'+(i===0?'-':esc(x[i-1]))+endTag('th');
      for(var j=0;j<=n;j++){
        var key=i+','+j;
        var cls='cell-unfilled';
        if(key in filled){
          var val=filled[key];
          cls=highlight[key]||(val===0?'cell-zero':'cell-filled');
        }
        var val=(key in filled)?filled[key]:'';
        h+='<td class="'+cls+'">'+val+endTag('td');
      }
      h+=endTag('tr');
    }
    h+=endTag('table');
    gridDiv.innerHTML=h;

    if(curStep<0){
      infoDiv.innerHTML='Press <b>Next</b> or <b>Play</b> to begin.';
      stepInfo.textContent='Step 0 / '+steps.length;
    }else{
      infoDiv.innerHTML=steps[curStep].msg;
      stepInfo.textContent='Step '+(curStep+1)+' / '+steps.length;
    }
  }

  function init(){
    stopPlay();
    curStep=-1;
    buildSteps();
    render();
  }
  function stepFwd(){
    if(curStep<steps.length-1){curStep++;render();return true;}
    return false;
  }
  function stepBack(){
    if(curStep>=0){curStep--;render();}
  }
  function getDelay(){return 600-(speedSlider.value-1)*55;}
  function startPlay(){
    playing=true;btnPlay.textContent='Pause';
    function tick(){
      if(!playing)return;
      if(!stepFwd()){stopPlay();return;}
      timer=setTimeout(tick,getDelay());
    }
    tick();
  }
  function stopPlay(){playing=false;btnPlay.textContent='Play';if(timer){clearTimeout(timer);timer=null;}}

  btnNext.addEventListener('click',function(){stopPlay();stepFwd();});
  btnBack.addEventListener('click',function(){stopPlay();stepBack();});
  btnPlay.addEventListener('click',function(){if(playing){stopPlay();}else{startPlay();}});
  btnReset.addEventListener('click',init);

  init();
})();
</script>
''')
Scores: match=+2, mismatch=-4, gap=-6 | Cell minimum clamped to 0
Step 0 / 0
Press Next or Play to begin.
def smith_waterman(x, y, s):
    """Smith-Waterman local alignment.
    
    Args:
        x, y: sequences to align
        s: scoring function s(char1, char2) -> score
    
    Returns:
        V: the filled DP matrix
        max_score: the best local alignment score
    """
    V = np.zeros((len(x) + 1, len(y) + 1), dtype=int)
    
    # Note: first row and column stay 0 (no gap initialization)
    for i in range(1, len(x) + 1):
        for j in range(1, len(y) + 1):
            V[i, j] = max(
                V[i-1, j-1] + s(x[i-1], y[j-1]),  # diagonal
                V[i-1, j]   + s(x[i-1], '-'),      # vertical
                V[i, j-1]   + s('-', y[j-1]),       # horizontal
                0                                   # reset
            )
    
    max_score = int(V.max())
    return V, max_score
import time

x = 'GGTATGCTGGCGCTA'
y = 'TATATGCGGCGTTT'

t0 = time.time()
V, max_score = smith_waterman(x, y, local_score)
elapsed = time.time() - t0
print(f"Best local alignment score: {max_score}")
print("The zeros scattered throughout the matrix are positions where the alignment 'restarted.'")
print("The maximum value indicates the best local alignment — explore the interactive matrix below.")
print(f"Time: {elapsed:.6f}s")
Best local alignment score: 12
The zeros scattered throughout the matrix are positions where the alignment 'restarted.'
The maximum value indicates the best local alignment — explore the interactive matrix below.
Time: 0.000152s

Traceback for local alignment

We start at the cell with the maximum score and trace back until we hit a cell with value 0.

def sw_traceback(V, x, y, s):
    """Traceback through a Smith-Waterman matrix to recover the local alignment."""
    # Find the position of the maximum score
    i, j = np.unravel_index(np.argmax(V), V.shape)
    
    aligned_x, aligned_y, midline = [], [], []
    
    # Trace back until we hit 0
    while i > 0 and j > 0 and V[i, j] != 0:
        diag = V[i-1, j-1] + s(x[i-1], y[j-1]) if (i > 0 and j > 0) else -1
        vert = V[i-1, j]   + s(x[i-1], '-')     if i > 0 else -1
        horz = V[i, j-1]   + s('-', y[j-1])     if j > 0 else -1
        
        if diag >= vert and diag >= horz:
            match = x[i-1] == y[j-1]
            midline.append('|' if match else ' ')
            aligned_x.append(x[i-1])
            aligned_y.append(y[j-1])
            i -= 1; j -= 1
        elif vert >= horz:
            aligned_x.append(x[i-1])
            aligned_y.append('-')
            midline.append(' ')
            i -= 1
        else:
            aligned_x.append('-')
            aligned_y.append(y[j-1])
            midline.append(' ')
            j -= 1
    
    aligned_x = ''.join(reversed(aligned_x))
    aligned_y = ''.join(reversed(aligned_y))
    midline   = ''.join(reversed(midline))
    
    return aligned_x, midline, aligned_y
import time

t0 = time.time()
ax, mid, ay = sw_traceback(V, x, y, local_score)
elapsed = time.time() - t0

print(f"Best local alignment (score = {max_score}):\n")
print(f"  x: {ax}")
print(f"     {mid}")
print(f"  y: {ay}")
print(f"Time: {elapsed:.6f}s")
Best local alignment (score = 12):

  x: TATGCTGGCG
     ||||| ||||
  y: TATGC-GGCG
Time: 0.000050s

The local alignment picks out just the high-similarity region, ignoring the poorly-matching flanks.

Let’s see how global and local alignment handle the same pair of sequences.

import time

t0 = time.time()
seq_x = 'AACCTTGGCTAACTGG'
seq_y = 'ACACTGTGA'

# --- Global alignment ---
D_g, score_g = global_alignment(seq_x, seq_y, penalty)
ax_g, mid_g, ay_g = global_alignment_traceback(D_g, seq_x, seq_y, penalty)

print("=== Global Alignment (Needleman-Wunsch) ===")
print(f"Penalty: {score_g}\n")
print(f"  x: {ax_g}")
print(f"     {mid_g}")
print(f"  y: {ay_g}")

print()

# --- Local alignment ---
V_l, score_l = smith_waterman(seq_x, seq_y, local_score)
ax_l, mid_l, ay_l = sw_traceback(V_l, seq_x, seq_y, local_score)

print("=== Local Alignment (Smith-Waterman) ===")
print(f"Score: {score_l}\n")
print(f"  x: {ax_l}")
print(f"     {mid_l}")
print(f"  y: {ay_l}")
print(f"\nTime: {time.time()-t0:.6f}s")
=== Global Alignment (Needleman-Wunsch) ===
Penalty: 62

  x: AACCTTGGCTAACTGG
      | |    ||   || 
  y: -A-C---ACT-G-TGA

=== Local Alignment (Smith-Waterman) ===
Score: 8

  x: ACTG
     ||||
  y: ACTG

Time: 0.000346s

Notice the difference: - Global alignment forces both entire sequences to align, introducing gaps to accommodate the length difference - Local alignment finds the best-matching subregion and ignores the rest

Practical considerations and extensions

As you apply these algorithms to real biological data, several practical issues arise:

  • Complexity: Both Needleman-Wunsch and Smith-Waterman run in \(O(mn)\) time and space. For two 1,000-bp sequences that’s fine, but genome-scale comparisons (millions of bases) require faster approaches.

  • Heuristic tools: BLAST, Minimap2, BWA-MEM, and LASTZ use seed-and-extend strategies — they find short exact matches first, then run DP only in promising regions. This trades some optimality for enormous speed gains.

  • Glocal alignment: For whole-genome comparison, glocal alignment chains multiple local alignments to cover both genomes while allowing rearrangements, inversions, and unaligned regions (used by MUMmer, MultiZ).

  • Space optimization: The full \(O(mn)\) matrix is only needed for traceback. To compute the score alone, a two-row trick reduces space to \(O(\min(m,n))\). Hirschberg’s algorithm recovers the full alignment in linear space.

  • Banded alignment: When sequences are known to be similar (edit distance \(\leq k\)), only a diagonal band of width \(2k+1\) needs to be filled, reducing cost to \(O(kn)\).

  • Gap penalty models: Our implementations use linear gap penalties, but affine gap penalties (\(g_{\text{open}} + k \cdot g_{\text{extend}}\)) are more biologically realistic — a single long indel event is more likely than many independent ones, so opening a gap is expensive but extending it is cheap.

  • Approximate matching: A variant of edit-distance DP can find where a short pattern occurs in a longer text with at most \(k\) edits, by initializing the first row to all zeros (allowing matches to start anywhere).

LASTZ: Whole-Genome Pairwise Alignment in Practice

The algorithms above — edit distance, Needleman-Wunsch, Smith-Waterman — are exact but limited to short sequences. Aligning two whole genomes (millions to billions of bases) with Smith-Waterman alone would take weeks. LASTZ (Large-Scale Genome Alignment Tool) solves this by using a seed-and-extend pipeline that applies the same DP ideas we’ve learned, but only where they’re likely to matter.

LASTZ is the workhorse behind the UCSC multi-species genome alignments. It is the successor to BLASTZ and is specifically designed for pairwise whole-genome comparison between related species.

The LASTZ pipeline

LASTZ finds alignments in stages, each one narrowing the search space before invoking expensive DP. We’ll walk through the pipeline using an example from the LASTZ documentation: aligning the human and chicken alpha-globin regions (~70 Kbp each).

Stage 1: Seeding — find candidate matches

The target genome is indexed into a lookup table of short seed words. The query is scanned with the same seed pattern, and wherever a word matches, that position becomes a seed hit.

LASTZ uses spaced seeds rather than requiring consecutive matching bases (as BLAST does). The default pattern is 1110100110010101111 — 12 out of 19 positions must match, but the 12 are spread out. This is more sensitive to diverged sequences because real mutations are scattered, not clustered.

The dot-plot below shows all 338 seed hits in a 3 Kbp window of the alpha-globin comparison. Each dot marks a position where the human and chicken sequences share a matching seed word:

Seed hits in a 3 Kbp window of the alpha-globin region

Most of these are noise — random matches in non-homologous regions. The next stages separate signal from noise.

We will discuss the data structures behind indexing and seed search (hash tables, suffix arrays, FM-index) in the next lecture.

Stage 2: Gap-free extension — find high-scoring segment pairs (HSPs)

Each seed hit is extended in both directions without gaps, using an X-drop criterion (stop when the score drops too far below the running maximum). If the extended segment scores above a threshold, it becomes an HSP — a high-scoring segment pair.

This is the same concept as the ungapped extension in BLAST. It’s fast because there’s no DP matrix — just a linear scan.

From the 338 seeds above, only 11 HSPs survive:

11 HSPs after gap-free extension

Stage 3: Gapped extension — full dynamic programming

Each surviving HSP is reduced to a single anchor point (its highest-scoring position), and from that anchor, LASTZ runs full dynamic programming alignment in both directions using affine gap penalties (a large cost to open a gap, a smaller cost to extend it). This is the same DP recurrence we saw in Needleman-Wunsch and Smith-Waterman, but applied as a bounded extension from an anchor rather than filling the entire matrix.

A Y-drop threshold limits how far the DP explores: extension halts when the score drops too far below the peak, keeping computation tractable.

The 11 HSPs collapse to 4 gapped alignment blocks:

4 gapped alignment blocks after DP extension

This is where the algorithms from this lecture do their real work — but only on the small regions identified by seeding, not the entire genome.

Stage 4: Chaining — enforce synteny

At the whole-region scale, many alignment blocks may exist, including spurious hits to paralogous (duplicated) regions. Chaining selects the best-scoring subset of blocks that are collinear — appearing in the same order and orientation in both genomes. This models the biological expectation that orthologous regions preserve their relative arrangement.

Here is the full alpha-globin comparison after chaining. The pipeline has reduced millions of potential comparisons down to a clean set of syntenic alignment blocks:

Final chained alignments for the alpha-globin region

Stage 5: Interpolation — fill the gaps

Finally, LASTZ can re-search the gaps between alignment blocks at higher sensitivity (using shorter seeds) to recover weak alignments that the initial pass missed. This “fills in” conserved regions that were too diverged for the standard seed to catch.

How LASTZ scores alignments

LASTZ uses the HOXD70 substitution matrix by default, derived from alignments at ~70% identity between human and other mammals:

A C G T
A +91 -114 -31 -123
C -114 +100 -125 -31
G -31 -125 +100 -114
T -123 -31 -114 +91

Notice the asymmetry: transitions (A↔︎G = -31, C↔︎T = -31) are penalized far less than transversions (e.g., A↔︎C = -114), reflecting their higher biological frequency — the same principle behind the penalty function we defined in Part 2. Gaps use affine penalties: open = 400, extend = 30.

The raw score of an alignment is the sum of substitution matrix values at each aligned column, minus gap penalties. A higher raw score means a better alignment.

Worked example

Consider this short gapped alignment between human and chicken:

human:   A C G T - A G
chicken: A C G G G A T

Scoring each column with the HOXD70 matrix:

Column human chicken Event Score
1 A A match +91
2 C C match +100
3 G G match +100
4 T G transversion -114
5 - G gap open -400
6 A A match +91
7 G T transversion -114

Raw score = 91 + 100 + 100 + (-114) + (-400) + 91 + (-114) = -246

A negative raw score indicates a poor alignment — the gap and transversions overwhelm the matches. Now compare with a gap-free alignment of a conserved region:

human:   A C G T A G
chicken: A C G T A G

Raw score = 91 + 100 + 100 + 91 + 91 + 100 = +573

This strongly positive score reflects genuine homology.

From raw scores to bit scores

Raw scores depend on the specific matrix and gap penalties used, so they’re hard to compare across different scoring schemes. The bit score normalizes this:

\[S' = \lambda \cdot S_{\text{raw}} - \ln(K)\]

where \(\lambda\) and \(K\) are statistical parameters estimated from the scoring matrix. The bit score measures alignment quality in information-theoretic units — how many bits of information the alignment provides above what you’d expect from random sequences of the same composition. A bit score of \(n\) means the alignment is \(2^n\) times more likely to arise from genuine homology than from chance.

LASTZ uses a quick approximation (borrowed from the UCSC Genome Browser codebase):

\[S'_{\text{bits}} \approx 0.0205 \times S_{\text{raw}}\]

For our two examples: - Poor alignment: \(0.0205 \times (-246) = -5.0\) bits (worse than random) - Perfect match: \(0.0205 \times 573 = 11.7\) bits (\(2^{11.7} \approx 3300\times\) more likely than chance)

E-values: is this alignment real?

The E-value (expected value) converts a bit score into the number of alignments with that score or better you’d expect by chance in a database of a given size:

\[E = \frac{m \cdot n}{2^{S'_{\text{bits}}}}\]

where \(m \times n\) is the search space (target length × query length). LASTZ approximates this with a genome-scale search space of \(3 \times 10^9\) (roughly the human genome):

\[E \approx 3 \times 10^9 \times 2^{-S'_{\text{bits}}}\]

For the perfect 6-bp match: \(E = 3 \times 10^9 \times 2^{-11.7} \approx 9 \times 10^5\) — still a large E-value because 6 bp is very short. A typical LASTZ HSP threshold of 3000 gives \(0.0205 \times 3000 = 61.5\) bits, which corresponds to \(E = 3 \times 10^9 \times 2^{-61.5} \approx 10^{-9}\) — essentially zero chance of being a false positive.

import time

# HOXD70 substitution matrix used by LASTZ
HOXD70 = {
    ('A','A'):  91, ('A','C'):-114, ('A','G'): -31, ('A','T'):-123,
    ('C','A'):-114, ('C','C'): 100, ('C','G'):-125, ('C','T'): -31,
    ('G','A'): -31, ('G','C'):-125, ('G','G'): 100, ('G','T'):-114,
    ('T','A'):-123, ('T','C'): -31, ('T','G'):-114, ('T','T'):  91,
}
GAP_OPEN = 400
GAP_EXTEND = 30

def lastz_score(aligned_x, aligned_y):
    """Score a gapped alignment using HOXD70 + affine gap penalties."""
    score = 0
    in_gap = False
    for a, b in zip(aligned_x, aligned_y):
        if a == '-' or b == '-':
            score -= GAP_EXTEND if in_gap else GAP_OPEN
            in_gap = True
        else:
            score += HOXD70[(a, b)]
            in_gap = False
    return score

def raw_to_bits(raw):
    return raw * 0.0205

def bits_to_evalue(bits, search_space=3e9):
    return search_space * 2**(-bits)

# Example 1: alignment with a gap and transversions
ax1 = "ACGT-AG"
ay1 = "ACGGGAT"
raw1 = lastz_score(ax1, ay1)
bits1 = raw_to_bits(raw1)
print(f"Alignment 1:  {ax1}")
print(f"              {ay1}")
print(f"  Raw score:  {raw1}")
print(f"  Bit score:  {bits1:.1f}")
print()

# Example 2: perfect match
ax2 = "ACGTAG"
ay2 = "ACGTAG"
raw2 = lastz_score(ax2, ay2)
bits2 = raw_to_bits(raw2)
print(f"Alignment 2:  {ax2}")
print(f"              {ay2}")
print(f"  Raw score:  {raw2}")
print(f"  Bit score:  {bits2:.1f}")
print(f"  E-value:    {bits_to_evalue(bits2):.2e}")
print()

# Example 3: a realistic 30-bp conserved region with a few transitions
ax3 = "ATGCGTACCTGAAGTTCGGATCCTAAGCCA"
ay3 = "ATGCGTACCTGAAGTTTGGATCCTAAGGCA"
#                      ^ C>T transition        ^ C>G transversion
raw3 = lastz_score(ax3, ay3)
bits3 = raw_to_bits(raw3)
print(f"Alignment 3:  {ax3}")
print(f"              {ay3}")
print(f"  Raw score:  {raw3}")
print(f"  Bit score:  {bits3:.1f}")
print(f"  E-value:    {bits_to_evalue(bits3):.2e}")
Alignment 1:  ACGT-AG
              ACGGGAT
  Raw score:  -246
  Bit score:  -5.0

Alignment 2:  ACGTAG
              ACGTAG
  Raw score:  573
  Bit score:  11.7
  E-value:    8.73e+05

Alignment 3:  ATGCGTACCTGAAGTTCGGATCCTAAGCCA
              ATGCGTACCTGAAGTTTGGATCCTAAGGCA
  Raw score:  2509
  Bit score:  51.4
  E-value:    9.86e-07

Why this pipeline matters

The key insight is that exact DP alignment is only applied where seeds suggest homology exists. This seed-and-extend strategy is shared by BLAST, Minimap2, BWA-MEM, and most practical sequence alignment tools. The differences lie in the details:

Feature BLAST LASTZ
Seeds Consecutive exact matches Spaced seeds (12-of-19) — more sensitive to diverged sequences
Use case One query vs. a database Pairwise whole-genome alignment
Chaining Not built in Enforces collinearity (synteny)
Gap penalties Simple Affine (large open + small extend)
Repeat handling Pre-filtering Soft masking (seeds excluded, extensions allowed)

LASTZ can align two mammalian-sized genomes in hours rather than the years that Smith-Waterman alone would require — while still using the same DP recurrence at its core.

LASTZ documentation and figures: Harris 2007; source code: github.com/lastz/lastz.

Homework: Create an Account on Galaxy

Galaxy is a free, open-source platform for accessible, reproducible, and transparent genomic research. Many of the alignment tools we discussed in this lecture — LASTZ, BWA-MEM, BLAST, Minimap2 — are available as point-and-click tools on Galaxy, so you can run real analyses without writing any code or installing software.

In the coming lectures we will use Galaxy for hands-on exercises. To prepare, create an account now.

Steps

  1. Go to https://usegalaxy.org in your browser.

  2. Click “Login or Register” in the top navigation bar.

  3. Click “Register here” at the bottom of the login form.

  4. Fill in the registration form:

    • Email: use your Penn State email (xyz123@psu.edu)
    • Password: choose a strong password
    • Public name: pick a username (this will be visible if you share workflows or histories)
  5. Check your email for a confirmation message from Galaxy and click the activation link.

That’s it — you’re ready for the next lecture.