Academia is the absolutely the best place to be if you are creative and curious. But to successfully navigate within academic world you need to understand some of its nuances. In these slides I will try to introduce you to some of them.
A textbook representation of a chromosome is a stick. Yet in vivo they appear to be folded in a variety of ways. What determined their folding? Is it random? Do non-homologous chromosomes intertwine or stay within well defined “chromosomal territories”. A number of sequencing-enabled technique provide answers to these questions.
An overview of ChIP seq This material is based on a variety of sources acknowledged throughout this document. It follows the logic of a comprehensive overview of ChIP-seq methodology compiled by Shaun Mahony at Penn State. Shaun also has an excellent presentation covering this topic. Many proteins interact with DNA A variety of proteins interact with genomic DNA including histones and various transcription factors and related proteins:
Previously we have learned about applications on PacBio technology: Assembly problem was featured prominently in this presentation. Today we will cover introductory concepts behind assembly. Genome assembly is a difficult task. In trying to explain it I will be relying on two highly regarded sources: Ben Langmead’s Teaching Materials Pevzner and Compeau Bioinformatics Book. Genomes and reads: Strings and k-mers k-mer composition Genomes are strings of text.
RNAseq: Reference-based This tutorial is inspired by an exceptional RNAseq course at the Weill Cornell Medical College compiled by Friederike Dündar, Luce Skrabanek, and Paul Zumbo and by tutorials produced by Björn Grüning (@bgruening) for Freiburg Galaxy instance. Much of Galaxy-related features described in this section have been developed by Björn Grüning (@bgruening) and configured by Dave Bouvier (@davebx). RNAseq can be roughly divided into two “types”:
This page explains how to perform discovery of low frequency variants from duplex sequencing data. As an example we use the ABL1 dataset published by Schmitt and colleagues (SRA accession SRR1799908). Background Calling low frequency variants from next generation sequencing (NGS) data is challenging due to significant amount of noise characteristic of these technologies. Duplex sequencing (DS) was designed to address this problem by increasing sequencing accuracy by over four orders of magnitude.
Tibet is high Many residents of the Tibetian Plateau live above 4,000 meters where oxygen concentration is approximately 40% lower than at sea level: Tibetan Plateau and surrounding areas (from Wikipedia). Tibetian experience a number of adaptations to high altitudes manifesting in the following phenotypic differences: lower hemoglobin concentration Wu:2005 higher arterial oxygen saturation Niermeyer:1995 more efficient pulmonary gas exchange Zhuang:1996 What are the genetic causes of these adaptations?
The majority of life on Earth is non-diploid and represented by prokaryotes, viruses and their derivatives such as our own mitochondria or plant’s chloroplasts. In non-diploid systems allele frequencies can range anywhere between 0 and 100% and there could be multiple (not just two) alleles per locus. The main challenge associated with non-diploid variant calling is the difficulty in distinguishing between sequencing noise (abundant in all NGS platforms) and true low frequency variants.
Today we hear a lot about personalized medicine. Yet the personalization is defined by the genetic make up of the individual. Today we will discuss how this information can be uncovered from the genomic sequencing data. The figure above shows distribution of rare and common variants in 1,092 human genomes described by the 1000 Genome Consortium. Calling variants Variant calling is a complex field that was significantly propelled by advances in DNA sequencing and efforts of large scientific consortia such as the 1000 Genomes.
In this section we will look at practical aspects of manipulation of next-generation sequencing data. We will start with Fastq format produced by most sequencing machines and will finish with SAM/BAM format representing mapped reads. The cover image above shows a screen dump of a SAM dataset. Getting NGS data in You can data in Galaxy using one of five ways: From your computer This works well for small files because web browser do not like lengthy file transfers:
Introduction to Galaxy In this lecture we will introduce you to bare basics of Galaxy: Getting data from external databases such as UCSC Performing simple data manipulation Understanding Galaxy’s History system Creating a running a workflow What are we trying to do? Suppose you get the following question: Mom (or Dad) … Which coding exon has the highest number of single nucleotide polymorphisms on chromosome 22?
Speeding things up The topics we discussed in the past lecture explain fundamental concepts behind analysis of biological sequences. Today, we will be talking about algorithms that allow aligning billions of sequencing reads against reference genomes. Similarly to the previous lecture I have borrowed heavily from the course taught by Ben Langmead at Johns Hopkins. The cover image is from Wikpedia article on Burrows-Wheeler transform. The challenge of really large datasets In the previous lecture we have seen how dynamic programming helps aligning sequences.
Sequence alignment In the previous lecture we have seen the principle behind dynamic programming. This approach is extremely useful for comparing biological sequences, which is coincidentally one of the main points of this course. This lecture explain how this is done. In writing this text I heavily relied on wonderful course taught by Ben Langmead at Johns Hopkins. The cover image shows pairwise alignments for human, mouse, and dog KIF3 locus from Dubchak et al.
Introduction In the previous lecture we have seen several ways in which DNA sequence data can be accumulated (the reason for having Manhattan in the figure above will be apparent a bit later). Because sequencing machines (especially the ones made by Illumina) generate billions of sequences (called reads) from every run, the real challenge is what one does with all this data once sequencing is done. So before we get into details of technology and its application we need to introduce some basic algorithmic concepts related to sequence analysis.
The 60s and the 70s The first complete nucleic acids being sequenced were RNAs (tRNAs in particular; see pioneering work of Robert Holley and colleagues). The work on finding approaches to sequencing DNA molecules began in late 60s and early 70s. One of the earliest contributions has been made by Ray Wu from Cornell, who used E. coli DNA polymerase to incorporate radioactively labelled nucleotides into protruding ends of bacteriphage lambda.
Why history? Knowing history is essential for understanding how we arrived to the current state of affairs in our field. It is also full of acciental discoveries and dramatic relationships making it quite interesting to read about. I strongly advise you to take a look at the mansucripts below. Classical publications 1965 | A history of genetics 1943 | Delbruck & Luria 1944 | Avery, MacLeod, & McCarty 1952 | Herhey & Chase 1953 | Watson & Crick 1958 | Meselson & Stahl 1960 | Jacob and Monod Popular (yet very informative) literature Get one of these and read it on a plane:
Instructor Anton Nekrutenko email@example.com Wartik 505 Office hours by appointment only ⚠ When contacting instructor use the above e-mail and include “BMMB554” in the subject line (simply click on e-mail address). Course description This course is designed as a preparation routine for graduate students in Life Sciences. It has several focus areas including evolution of life sciences as well as in-depth overview of sequencing technologies and their applications.