Mapping Reads II
mapping SAM/BAM
Prep the environment
Install conda, configure path and channles:
!wget -c https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
!chmod +x Anaconda3-5.1.0-Linux-x86_64.sh
!bash ./Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.6/site-packages/')
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forge
Mount the drive and copy duplex files into notebook filesystem (in my specific case I created bmmb554_data
directiory at the root of my drive):
cp /content/drive/My\ Drive/bmmb554_data/r1* ./
Mapping data with bwa
and processing with samtools
ls
Install bwa [also see Li:2012] :
!conda install -y bwa
Upload sequence of pBR322 from Zenodo
!wget https://zenodo.org/record/3637681/files/pbr322.fa
Create bwa index files:
!bwa index -p plasmid pbr322.fa
Let look at bwa command line options:
!bwa
!bwa mem
We can speed up process to give bwa
more threads to work with. How many threads does our system have:
cat /proc/cpuinfo
Our paired end data is represented as two files: forward and reverse. We can feed these to bwa manually by typing their names but since we are inside a python envirtonment we can do this programmatically.
# Get the list of `fq.gz` files
import os
files = []
for file in os.listdir():
if file.endswith('fq.gz'):
files.append(file)
files = sorted(files)
# It is a list:
files
# We need toi organize these file names in pairs. In this particular case we want to have a list of two pairs.
# A flat list can be organzed into a 2 x 2 list using numpy
import numpy as np
pairs = np.reshape(files,(2,2))
pairs
# To make it more generic we can do this:
pairs = np.reshape(files,(len(files)//2,2))
pairs
Now we can run bwa to produce sam files:
# Now let's create a command line and execute it
for pair in pairs:
cmd = 'bwa mem -t 2 plasmid {} {} > {}.sam'.format(pair[0],pair[1],pair[0][:5])
print(cmd)
!{cmd}
This generates two sam files:
ls
Sam format looks like this:
!head r1-s0.sam
Let’s look at SAM/BAM format specification. Before we can use BAM file they need to be softed and indexed:
!conda install -y samtools
!samtools view -bh r1-s0.sam > r1-s0.bam
!samtools sort r1-s0.bam > r1-s0.sorted.bam
!samtools index r1-s0.sorted.bam
ls
Alternatively, this can be done in one go using piping. First, let’s delete sam and bam files we’ve produced so far:
rm *.sam *.bam*
ls
# Rerunning bwa piping output directly into samtools:
for pair in pairs:
cmd = 'bwa mem -t 2 plasmid {0} {1} | samtools view -bh - | samtools sort - > {2}.bam; samtools index {2}.bam'.format(pair[0],pair[1],pair[0][:5])
print(cmd)
!{cmd}
Assessing mapping results
deepTools
is an excellent set of tools for manipulation of mapping data
pip install deeptools
import os
for file in os.listdir():
if file.endswith('bam'):
cmd = 'bamCoverage -b {0}.bam -o {0}.bw'.format(file[:-4])
!{cmd}
pip install pygenometracks
Upload annotaion of pBR322 from Zenodo
Pull in annotaion of pBR322 from Zenodo
!wget https://zenodo.org/record/3637681/files/pBR322.bed
The bed file we just uploaded is not sorted. To sort it let’s install and use bedtools:
!sudo apt-get install bedtools
!bedtools sort -h -i pBR322.bed > a; mv a pBR322.bed
Now, create configuration file for pyGenomeTracks:
!make_tracks_file --trackFiles pBR322.bed *.bw -o tracks.ini
And visualize!
!pyGenomeTracks --tracks tracks.ini --region pBR322:1-3000 --outFileName nice_image.svg
from IPython.core.display import SVG
SVG(filename='nice_image.svg')