Defining project and getting data
python conda bioconda sra duplex sequencing HIV

WarmUp with Python Strings
- Open this notebook
- Make a copy of it in your drive
- Go through it and execute all cells
The project
The goal of our project will be finding a perfect variant caller for non-diploid data (e.g., viral, bacterial, organellar).
For this we will use two datasets:
- An HIV resequencing data from the DC cohort study
- A duplex sequencing dataset from an experimental evolution study
The idea is to use duplex data as the “golden standard” and then apply the best caller we find to HIV data and see if we can find anything interesting.
Prep the system
- Log in into Google Drive using your Penn State credentials (obviously if you are outside Penn State just use your Google log in or whatever works)
- Start a new Python 3 notebook in colaboratory
- Name it an intelligent way (e.g.,
seq_data_QC)
Install Conda
Run a cell with the following code
!wget -c https://repo.continuum.io/archive/Anaconda3-5.1.0-Linux-x86_64.sh
!chmod +x Anaconda3-5.1.0-Linux-x86_64.sh
!bash ./Anaconda3-5.1.0-Linux-x86_64.sh -b -f -p /usr/local
import sys
sys.path.append('/usr/local/lib/python3.6/site-packages/')
!conda config --add channels defaults
!conda config --add channels bioconda
!conda config --add channels conda-forgeThis code will install and configure Conda without colaboratory virtual machine.
Get the data from Zenodo
First we will download duplex data from Zenodo repository
First use wget to get just one file:
!wget https://zenodo.org/record/3565911/files/pbr322.faThis is not convenient because you would have to do this five times. So let’s install zenodo-get – a tool for sucking things out of Zenodo (remember that this instance of Jupyter runs in your little personal virtual machine where you can install whatever you want):
!pip install zenodo-getNow we can download everything from this Zenodo project:
!zenodo_get.py -d 10.5281/zenodo.3565911Once it is done, use ls to look if the files are there.
Get the data from SRA
The DC cohort re-sequencing data has been deposited to SRA at NCBI. We will download just one dataset for today’s lecture: SRR8525907. We will do this programmatically using sra-toolkit, which we first need to install. This is really why we installed Conda in the first place:
!conda install -y sra-toolsonce finished look at command options of the tool we will be using fasterq-dump:
!fasterq-dump -hthe option we need is -S (write reads into different files). We also remember that we need to download data with accession SRR8525907:
!fasterq-dump -S -p SRR8525907Use ls to look at the files. This will produce two files with extension .fastq.
Harmonize filenames
First the files we downloaded from SRA are not compressed and there is no good reason to keep then uncompressed. But before we compress them check their sizes with ls -lh. Now let’s compress them (* is the wild card):
!gzip *.fastqcheck the sizes of files with ls -lh again to appreciate how much space is saved by compression.
Now let’s get all fastq files in our directory to have the same extension:
import os
for file in os.listdir():
if file.endswith('.fastq.gz'):
new_name = file.replace('fastq.gz','fq.gz')
!mv {file} {new_name}now all files have extension fq.gz!
QCing
To get some idea about the quality of the reads we will install and run fastqc. First install:
!conda install -y fastqcand now run on all *.fq.gz files:
!fastqc *.fq.gzfastqc generates a separate html report for each file. It is tedious to look through everyone of them. We can aggregate results with multiqc. Install multiqc with this commands:
!conda install -y multiqcand run in:
!multiqc ./it will generate a web page containing a report. You can visualize this web-page directly in your notebook:
from IPython.display import HTML
HTML(filename="multiqc_report.html")Adapter trimming
It really looks like we need to trim adapters off of some of the sequences. For this purpose we will use trim-galore. Let’s install it first:
!conda install -y trim-galoreOnce it is installed look at command line options using --help flag. Now we can trim the adapters using the following command:
!trim_galore --paired *.fq.gzThis will generate a new set of fastq files with extension _val_1{1|2}.fq.gz. Before we do anything with these files let’s summarize what trim-galore actually did. This is again done with multiqc:
!multiqc -n tg *trimming_report.txtand to look at the report execute the following:
from IPython.display import HTML
HTML(filename="tg.html")you can see that HIV samples were trimmed to a much higher extent than the other datasets.
Saving your work
Now we need to same the trimmed files to Google Drive, so we can use them in the next class. To do this click on the folder icon in the upper left and click “Mount drive”. This will ask you execute a code cell and enter an authorization code. No go to your drive and create an appropriately names folder. For example, bmmb554_data. Now copy all trimmed files into this folder:
cp *val* /content/drive/My\ Drive/bmmb554_datayour files will magically appear there!