import pandas as pdLecture 7: Data Manipulation with Pandas
This tutorial draws on material from: - Justin Bois - BIOS821 course at Duke - Pandas documentation
Pandas (from “Panel Data”) is a library for manipulating tabular data in Python. Its central object—the DataFrame—lets you filter, transform, aggregate, and reshape tables with concise, readable code. We will use a small genome statistics dataset throughout this lecture.
Creating a DataFrame
The simplest way to create a DataFrame is from a Python dictionary. Each key becomes a column name, and the corresponding list becomes the column values.
small_df = pd.DataFrame({
'organism': ['E. coli', 'S. cerevisiae', 'H. sapiens'],
'genome_size_mb': [4.6, 12.1, 3100.0],
'num_genes': [4300, 6000, 20000]
})
small_df| organism | genome_size_mb | num_genes | |
|---|---|---|---|
| 0 | E. coli | 4.6 | 4300 |
| 1 | S. cerevisiae | 12.1 | 6000 |
| 2 | H. sapiens | 3100.0 | 20000 |
print('Shape:', small_df.shape)
print()
print(small_df.dtypes)Shape: (3, 3)
organism object
genome_size_mb float64
num_genes int64
dtype: object
Exercise
Create a DataFrame called viruses with 3 columns: name, genome_size_kb, and host. Include at least 3 rows of made-up or real virus data. Print its shape.
# Your code hereReading CSV files
In practice, data lives in files. pd.read_csv() reads a CSV into a DataFrame. Here we load a genome statistics dataset:
df = pd.read_csv('https://raw.githubusercontent.com/nekrut/bda/main/lectures/data/genome_stats.csv')
df.head()| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | |
|---|---|---|---|---|---|---|
| 0 | Escherichia coli | Bacteria | 4.6 | 50.8 | 4300 | gut |
| 1 | Bacillus subtilis | Bacteria | 4.2 | 43.5 | 4100 | soil |
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated |
| 3 | Staphylococcus aureus | Bacteria | 2.8 | 32.9 | 2600 | host-associated |
| 4 | Pseudomonas aeruginosa | Bacteria | 6.3 | 66.6 | 5700 | soil |
print('Shape:', df.shape)
print()
print(df.dtypes)Shape: (20, 6)
organism object
kingdom object
genome_size_mb float64
gc_percent float64
num_genes int64
habitat object
dtype: object
Exercise
Use .tail() to show the last 3 rows of df. Then use .describe() to get summary statistics. Which organism has the largest genome?
# Your code hereIndexing
There are three main ways to select data from a DataFrame:
df['col']— select a single column (returns a Series)df.loc[row_label, col_label]— label-based indexing (uses actual row/column names)df.iloc[row_pos, col_pos]— integer position-based indexing
# Select a single column
df['organism'].head()0 Escherichia coli
1 Bacillus subtilis
2 Mycobacterium tuberculosis
3 Staphylococcus aureus
4 Pseudomonas aeruginosa
Name: organism, dtype: object
# Label-based: row 3, column 'gc_percent'
df.loc[3, 'gc_percent']np.float64(32.9)
# Integer position-based: rows 2 through 4
df.iloc[2:5]| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | |
|---|---|---|---|---|---|---|
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated |
| 3 | Staphylococcus aureus | Bacteria | 2.8 | 32.9 | 2600 | host-associated |
| 4 | Pseudomonas aeruginosa | Bacteria | 6.3 | 66.6 | 5700 | soil |
Exercise
Using df.loc, extract the organism and habitat columns for rows 5 through 9. Then use df.iloc to get the value in the 4th row, 3rd column.
# Your code hereFiltering
Boolean indexing lets you select rows that match a condition. The expression inside the brackets produces a True/False Series, and only rows where the value is True are kept.
Note:
df[condition]is shorthand fordf.loc[condition]—both are acceptable. We use the shorter form here for readability.
# All bacteria
df[df['kingdom'] == 'Bacteria']| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | |
|---|---|---|---|---|---|---|
| 0 | Escherichia coli | Bacteria | 4.6 | 50.8 | 4300 | gut |
| 1 | Bacillus subtilis | Bacteria | 4.2 | 43.5 | 4100 | soil |
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated |
| 3 | Staphylococcus aureus | Bacteria | 2.8 | 32.9 | 2600 | host-associated |
| 4 | Pseudomonas aeruginosa | Bacteria | 6.3 | 66.6 | 5700 | soil |
| 5 | Streptococcus pneumoniae | Bacteria | 2.2 | 39.7 | 2100 | host-associated |
| 6 | Vibrio cholerae | Bacteria | 4.0 | 47.5 | 3800 | aquatic |
| 7 | Clostridioides difficile | Bacteria | 4.3 | 29.1 | 3800 | gut |
# Bacteria with GC content above 50%
df[(df['kingdom'] == 'Bacteria') & (df['gc_percent'] > 50)]| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | |
|---|---|---|---|---|---|---|
| 0 | Escherichia coli | Bacteria | 4.6 | 50.8 | 4300 | gut |
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated |
| 4 | Pseudomonas aeruginosa | Bacteria | 6.3 | 66.6 | 5700 | soil |
Exercise
Filter df to show only organisms that live in "aquatic" habitats. Then find all Eukarya with more than 20,000 genes.
# Your code hereAdding columns
You can create new columns from existing ones. Pandas applies operations element-wise (vectorization)—no for loop needed.
Note: This permanently adds the column to
df. All subsequent cells will see 7 columns instead of 6.
df['genes_per_mb'] = df['num_genes'] / df['genome_size_mb']
df[['organism', 'genome_size_mb', 'num_genes', 'genes_per_mb']].head()| organism | genome_size_mb | num_genes | genes_per_mb | |
|---|---|---|---|---|
| 0 | Escherichia coli | 4.6 | 4300 | 934.782609 |
| 1 | Bacillus subtilis | 4.2 | 4100 | 976.190476 |
| 2 | Mycobacterium tuberculosis | 4.4 | 4000 | 909.090909 |
| 3 | Staphylococcus aureus | 2.8 | 2600 | 928.571429 |
| 4 | Pseudomonas aeruginosa | 6.3 | 5700 | 904.761905 |
Exercise
Add a column gc_category that is "high" when gc_percent > 50 and "low" otherwise. Hint: use np.where() — you’ll need import numpy as np.
# Your code hereWriting CSV
Use df.to_csv() to save a DataFrame back to disk. Setting index=False prevents Pandas from writing the row numbers as an extra unnamed column—without it, the CSV gets cluttered with a meaningless index.
df.to_csv('genome_stats_with_density.csv', index=False)
# Verify round-trip
pd.read_csv('genome_stats_with_density.csv').head(3)| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | genes_per_mb | |
|---|---|---|---|---|---|---|---|
| 0 | Escherichia coli | Bacteria | 4.6 | 50.8 | 4300 | gut | 934.782609 |
| 1 | Bacillus subtilis | Bacteria | 4.2 | 43.5 | 4100 | soil | 976.190476 |
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated | 909.090909 |
Exploring categories
Before discussing data organization principles, let’s explore what categories exist in our dataset.
df['kingdom'].unique()array(['Bacteria', 'Eukarya', 'Archaea'], dtype=object)
df['kingdom'].value_counts()kingdom
Bacteria 8
Eukarya 8
Archaea 4
Name: count, dtype: int64
df['habitat'].value_counts()habitat
aquatic 5
terrestrial 5
soil 4
host-associated 4
gut 2
Name: count, dtype: int64
Exercise
How many unique habitats are there? Which habitat has the fewest organisms? Use .unique() and .value_counts().
# Your code hereTidy data
Hadley Wickham defined tidy data with three rules:
- Each variable is a column.
- Each observation is a row.
- Each type of observation has its own separate data frame.
Tidy data is far easier to filter, aggregate, and plot.
Our genome_stats.csv is already tidy: each row is one organism (an observation), each column is a single variable (kingdom, genome size, etc.), and all rows describe the same type of thing—genome statistics.
Wide vs long format
Data often arrives in wide format—one column per experimental condition. This is convenient for spreadsheets but violates tidy principles because the column headers contain data (condition names).
Below is a small gene expression dataset. The values represent expression levels measured in FPKM (Fragments Per Kilobase of transcript per Million mapped reads)—a common unit for RNA-seq experiments.
expr = pd.read_csv('https://raw.githubusercontent.com/nekrut/bda/main/lectures/data/gene_expression_wide.csv')
expr| gene | control | heat_shock | starvation | |
|---|---|---|---|---|
| 0 | geneA | 5.2 | 12.1 | 3.8 |
| 1 | geneB | 8.7 | 7.9 | 2.1 |
| 2 | geneC | 1.3 | 9.4 | 6.5 |
| 3 | geneD | 4.0 | 4.2 | 11.3 |
The condition columns (control, heat_shock, starvation) are really values of a variable we might call condition. We use pd.melt() to reshape this into tidy (long) format.
Key parameters of pd.melt(): - id_vars — columns to keep as identifiers (not melted) - var_name — name for the new column created from the old column headers - value_name — name for the new column holding the cell values
expr_long = pd.melt(
expr,
id_vars=['gene'],
var_name='condition',
value_name='expression'
)
expr_long| gene | condition | expression | |
|---|---|---|---|
| 0 | geneA | control | 5.2 |
| 1 | geneB | control | 8.7 |
| 2 | geneC | control | 1.3 |
| 3 | geneD | control | 4.0 |
| 4 | geneA | heat_shock | 12.1 |
| 5 | geneB | heat_shock | 7.9 |
| 6 | geneC | heat_shock | 9.4 |
| 7 | geneD | heat_shock | 4.2 |
| 8 | geneA | starvation | 3.8 |
| 9 | geneB | starvation | 2.1 |
| 10 | geneC | starvation | 6.5 |
| 11 | geneD | starvation | 11.3 |
To go back from long to wide, use .pivot():
expr_long.pivot(index='gene', columns='condition', values='expression')| condition | control | heat_shock | starvation |
|---|---|---|---|
| gene | |||
| geneA | 5.2 | 12.1 | 3.8 |
| geneB | 8.7 | 7.9 | 2.1 |
| geneC | 1.3 | 9.4 | 6.5 |
| geneD | 4.0 | 4.2 | 11.3 |
Exercise
The DataFrame below is in wide format. Melt it into tidy (long) format, then pivot it back to wide.
temps = pd.DataFrame({
'city': ['NYC', 'LA', 'Chicago'],
'jan': [-1, 14, -5],
'jul': [28, 24, 27]
})# Your code hereSorting
Use sort_values() to order rows by one or more columns. By default sorting is ascending; pass ascending=False to reverse.
df.sort_values('genome_size_mb')| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | genes_per_mb | |
|---|---|---|---|---|---|---|---|
| 17 | Methanocaldococcus jannaschii | Archaea | 1.7 | 31.3 | 1700 | aquatic | 1000.000000 |
| 19 | Thermococcus kodakarensis | Archaea | 2.1 | 52.0 | 2300 | aquatic | 1095.238095 |
| 5 | Streptococcus pneumoniae | Bacteria | 2.2 | 39.7 | 2100 | host-associated | 954.545455 |
| 18 | Sulfolobus acidocaldarius | Archaea | 2.2 | 36.7 | 2200 | terrestrial | 1000.000000 |
| 16 | Halobacterium salinarum | Archaea | 2.6 | 65.9 | 2600 | aquatic | 1000.000000 |
| 3 | Staphylococcus aureus | Bacteria | 2.8 | 32.9 | 2600 | host-associated | 928.571429 |
| 6 | Vibrio cholerae | Bacteria | 4.0 | 47.5 | 3800 | aquatic | 950.000000 |
| 1 | Bacillus subtilis | Bacteria | 4.2 | 43.5 | 4100 | soil | 976.190476 |
| 7 | Clostridioides difficile | Bacteria | 4.3 | 29.1 | 3800 | gut | 883.720930 |
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated | 909.090909 |
| 0 | Escherichia coli | Bacteria | 4.6 | 50.8 | 4300 | gut | 934.782609 |
| 4 | Pseudomonas aeruginosa | Bacteria | 6.3 | 66.6 | 5700 | soil | 904.761905 |
| 8 | Saccharomyces cerevisiae | Eukarya | 12.1 | 38.3 | 6000 | soil | 495.867769 |
| 9 | Candida auris | Eukarya | 12.4 | 44.5 | 5400 | host-associated | 435.483871 |
| 11 | Caenorhabditis elegans | Eukarya | 100.3 | 35.4 | 20000 | soil | 199.401795 |
| 12 | Arabidopsis thaliana | Eukarya | 135.0 | 36.0 | 27000 | terrestrial | 200.000000 |
| 10 | Drosophila melanogaster | Eukarya | 144.0 | 42.5 | 14000 | terrestrial | 97.222222 |
| 15 | Danio rerio | Eukarya | 1400.0 | 36.6 | 25000 | aquatic | 17.857143 |
| 14 | Mus musculus | Eukarya | 2700.0 | 42.0 | 22000 | terrestrial | 8.148148 |
| 13 | Homo sapiens | Eukarya | 3100.0 | 40.9 | 20000 | terrestrial | 6.451613 |
# Multi-column sort: kingdom ascending, genome size descending
df.sort_values(
by=['kingdom', 'genome_size_mb'],
ascending=[True, False]
)| organism | kingdom | genome_size_mb | gc_percent | num_genes | habitat | genes_per_mb | |
|---|---|---|---|---|---|---|---|
| 16 | Halobacterium salinarum | Archaea | 2.6 | 65.9 | 2600 | aquatic | 1000.000000 |
| 18 | Sulfolobus acidocaldarius | Archaea | 2.2 | 36.7 | 2200 | terrestrial | 1000.000000 |
| 19 | Thermococcus kodakarensis | Archaea | 2.1 | 52.0 | 2300 | aquatic | 1095.238095 |
| 17 | Methanocaldococcus jannaschii | Archaea | 1.7 | 31.3 | 1700 | aquatic | 1000.000000 |
| 4 | Pseudomonas aeruginosa | Bacteria | 6.3 | 66.6 | 5700 | soil | 904.761905 |
| 0 | Escherichia coli | Bacteria | 4.6 | 50.8 | 4300 | gut | 934.782609 |
| 2 | Mycobacterium tuberculosis | Bacteria | 4.4 | 65.6 | 4000 | host-associated | 909.090909 |
| 7 | Clostridioides difficile | Bacteria | 4.3 | 29.1 | 3800 | gut | 883.720930 |
| 1 | Bacillus subtilis | Bacteria | 4.2 | 43.5 | 4100 | soil | 976.190476 |
| 6 | Vibrio cholerae | Bacteria | 4.0 | 47.5 | 3800 | aquatic | 950.000000 |
| 3 | Staphylococcus aureus | Bacteria | 2.8 | 32.9 | 2600 | host-associated | 928.571429 |
| 5 | Streptococcus pneumoniae | Bacteria | 2.2 | 39.7 | 2100 | host-associated | 954.545455 |
| 13 | Homo sapiens | Eukarya | 3100.0 | 40.9 | 20000 | terrestrial | 6.451613 |
| 14 | Mus musculus | Eukarya | 2700.0 | 42.0 | 22000 | terrestrial | 8.148148 |
| 15 | Danio rerio | Eukarya | 1400.0 | 36.6 | 25000 | aquatic | 17.857143 |
| 10 | Drosophila melanogaster | Eukarya | 144.0 | 42.5 | 14000 | terrestrial | 97.222222 |
| 12 | Arabidopsis thaliana | Eukarya | 135.0 | 36.0 | 27000 | terrestrial | 200.000000 |
| 11 | Caenorhabditis elegans | Eukarya | 100.3 | 35.4 | 20000 | soil | 199.401795 |
| 9 | Candida auris | Eukarya | 12.4 | 44.5 | 5400 | host-associated | 435.483871 |
| 8 | Saccharomyces cerevisiae | Eukarya | 12.1 | 38.3 | 6000 | soil | 495.867769 |
Exercise
Sort df by num_genes in descending order. Which organism has the most genes? Then sort by habitat ascending and gc_percent descending.
# Your code hereSplit-apply-combine
Many analyses follow a three-step pattern described by Hadley Wickham:
- Split the data into groups (e.g., by kingdom)
- Apply a function to each group (e.g., compute the mean)
- Combine the results into a new table
In Pandas this is done with groupby().
Aggregation
df.groupby('kingdom')['genome_size_mb'].mean()kingdom
Archaea 2.150
Bacteria 4.100
Eukarya 950.475
Name: genome_size_mb, dtype: float64
df.groupby('kingdom')['genome_size_mb'].describe()| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| kingdom | ||||||||
| Archaea | 4.0 | 2.150 | 0.369685 | 1.7 | 2.000 | 2.15 | 2.30 | 2.6 |
| Bacteria | 8.0 | 4.100 | 1.227076 | 2.2 | 3.700 | 4.25 | 4.45 | 6.3 |
| Eukarya | 8.0 | 950.475 | 1291.848037 | 12.1 | 78.325 | 139.50 | 1725.00 | 3100.0 |
# Group by two columns
df.groupby(['kingdom', 'habitat'])['num_genes'].mean()kingdom habitat
Archaea aquatic 2200.0
terrestrial 2200.0
Bacteria aquatic 3800.0
gut 4050.0
host-associated 2900.0
soil 4900.0
Eukarya aquatic 25000.0
host-associated 5400.0
soil 13000.0
terrestrial 20750.0
Name: num_genes, dtype: float64
Exercise
Use groupby to find the maximum gc_percent for each kingdom. Then compute the mean num_genes grouped by habitat.
# Your code hereWorking with multiple tables
Real analyses often require combining information from different sources. pd.merge() joins two DataFrames on a shared key—similar to SQL joins.
Let’s create two small DataFrames to demonstrate different join types.
taxonomy = pd.DataFrame({
'organism': ['Escherichia coli', 'Saccharomyces cerevisiae',
'Homo sapiens', 'Halobacterium salinarum'],
'phylum': ['Pseudomonadota', 'Ascomycota',
'Chordata', 'Euryarchaeota']
})
taxonomy| organism | phylum | |
|---|---|---|
| 0 | Escherichia coli | Pseudomonadota |
| 1 | Saccharomyces cerevisiae | Ascomycota |
| 2 | Homo sapiens | Chordata |
| 3 | Halobacterium salinarum | Euryarchaeota |
isolation = pd.DataFrame({
'organism': ['Escherichia coli', 'Bacillus subtilis',
'Homo sapiens', 'Drosophila melanogaster'],
'first_sequenced': [1997, 1997, 2001, 2000]
})
isolation| organism | first_sequenced | |
|---|---|---|
| 0 | Escherichia coli | 1997 |
| 1 | Bacillus subtilis | 1997 |
| 2 | Homo sapiens | 2001 |
| 3 | Drosophila melanogaster | 2000 |
Inner join
Keeps only rows whose key appears in both tables.
pd.merge(taxonomy, isolation, on='organism')| organism | phylum | first_sequenced | |
|---|---|---|---|
| 0 | Escherichia coli | Pseudomonadota | 1997 |
| 1 | Homo sapiens | Chordata | 2001 |
Left join
Keeps all rows from the left table; fills missing matches with NaN.
pd.merge(taxonomy, isolation, on='organism', how='left')| organism | phylum | first_sequenced | |
|---|---|---|---|
| 0 | Escherichia coli | Pseudomonadota | 1997.0 |
| 1 | Saccharomyces cerevisiae | Ascomycota | NaN |
| 2 | Homo sapiens | Chordata | 2001.0 |
| 3 | Halobacterium salinarum | Euryarchaeota | NaN |
What is NaN? NaN stands for “Not a Number” and represents missing data. When a left join finds no matching row in the right table, Pandas fills those cells with NaN.
Right join
Keeps all rows from the right table.
pd.merge(taxonomy, isolation, on='organism', how='right')| organism | phylum | first_sequenced | |
|---|---|---|---|
| 0 | Escherichia coli | Pseudomonadota | 1997 |
| 1 | Bacillus subtilis | NaN | 1997 |
| 2 | Homo sapiens | Chordata | 2001 |
| 3 | Drosophila melanogaster | NaN | 2000 |
Outer join
Keeps all rows from both tables.
pd.merge(taxonomy, isolation, on='organism', how='outer')| organism | phylum | first_sequenced | |
|---|---|---|---|
| 0 | Bacillus subtilis | NaN | 1997.0 |
| 1 | Drosophila melanogaster | NaN | 2000.0 |
| 2 | Escherichia coli | Pseudomonadota | 1997.0 |
| 3 | Halobacterium salinarum | Euryarchaeota | NaN |
| 4 | Homo sapiens | Chordata | 2001.0 |
| 5 | Saccharomyces cerevisiae | Ascomycota | NaN |
Exercise
Create two DataFrames and merge them:
habitats = pd.DataFrame({
'habitat': ['gut', 'soil', 'aquatic'],
'temperature_c': [37, 20, 15]
})Merge habitats with df using a left join on 'habitat'. How many rows have NaN for temperature_c? Why?
# Your code hereVisualization with Altair
Let’s make a quick bar chart of mean genome size by kingdom using groupby + Altair.
Altair uses single-letter encoding type suffixes: - :N — nominal (categorical, unordered) data like kingdom names - :Q — quantitative (numeric, continuous) data like genome size - :O — ordinal (ordered categories) and :T — temporal (dates/times)
import altair as alt
kingdom_means = df.groupby('kingdom')['genome_size_mb'].mean().reset_index()
alt.Chart(kingdom_means).mark_bar().encode(
x=alt.X('kingdom:N', title='Kingdom'),
y=alt.Y('genome_size_mb:Q', title='Mean genome size (Mb)'),
color='kingdom:N'
).properties(
width=300,
title='Mean Genome Size by Kingdom'
)Eukarya genomes are orders of magnitude larger than Bacteria and Archaea, making the smaller kingdoms nearly invisible on a linear scale. A logarithmic y-axis lets us compare all three kingdoms meaningfully.
alt.Chart(kingdom_means).mark_bar().encode(
x=alt.X('kingdom:N', title='Kingdom'),
y=alt.Y('genome_size_mb:Q', title='Mean genome size (Mb)', scale=alt.Scale(type='sqrt', domainMin=1)),
color='kingdom:N'
).properties(
width=300,
title='Mean Genome Size by Kingdom (log scale)'
)Exercise
Create a scatter plot of genome_size_mb (x) vs num_genes (y), colored by kingdom. Use mark_point() instead of mark_bar().
# Your code hereSummary
- DataFrame basics: create from dict, read CSV, inspect with
.head(),.shape,.dtypes - Indexing:
[],.loc(label),.iloc(position) - Filtering: boolean indexing with
&for compound conditions - New columns: vectorized arithmetic—no loops needed
- Tidy data: each variable a column, each observation a row
- Reshaping:
melt()(wide → long),pivot()(long → wide) - Sorting:
sort_values()with multi-column and mixed ascending/descending - Split-apply-combine:
groupby()→.mean(),.describe(),.agg() - Joins:
pd.merge()with inner, left, right, outer - Visualization: Altair integrates directly with DataFrames
A more realistic example
The Sequence Read Archive (SRA) is the largest public repository of sequencing data, mirrored by the European Nucleotide Archive (ENA). Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.
Setup
import pandas as pdWe use a pre-compiled ENA metadata snapshot hosted on Zenodo. The file contains ~800k records; we load the first 100k for speed.
sra = pd.read_csv(
"https://zenodo.org/records/10680776/files/ena.tsv.gz",
compression='gzip',
sep="\t",
low_memory=False,
nrows=100000
)Explore the data
len(sra)100000
sra.sample(5)| study_accession | base_count | accession | collection_date | country | culture_collection | description | sample_collection | sample_title | sequencing_method | ... | library_name | library_construction_protocol | library_layout | instrument_model | instrument_platform | isolation_source | isolate | investigation_type | collection_date_submitted | center_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 98517 | PRJEB37886 | 1.021151e+09 | SAMEA10938693 | 2021-10-29 | United Kingdom | NaN | Illumina NovaSeq 6000 sequencing; Illumina Nov... | NaN | COG-UK/NEWC-28D0557 | NaN | ... | NT1705557N / HT-122795:H9 | NaN | PAIRED | Illumina NovaSeq 6000 | ILLUMINA | NaN | NaN | NaN | 2021-10-29 | SC |
| 30031 | PRJEB43060 | 2.903531e+08 | SAMEA12516416 | 2021-08-04 | Norway | NaN | Illumina NovaSeq 6000 sequencing; Raw reads: S... | NaN | SARS-CoV-2/human/Norway/16523/2021/1 | NaN | ... | NaN | NaN | PAIRED | Illumina NovaSeq 6000 | ILLUMINA | upper respiratory secretion | SARS-CoV-2/human/Norway/16523/2021 | NaN | 2021-08-04 | Norwegian Institute of Public Health (NIPH) |
| 60679 | PRJEB37886 | 8.712552e+08 | SAMEA8237191 | 2021-02-24 | United Kingdom | NaN | Illumina NovaSeq 6000 sequencing; Illumina Nov... | NaN | COG-UK/CAMC-1359F50 | NaN | ... | NT1665242V / DN785905B:A10 | NaN | PAIRED | Illumina NovaSeq 6000 | ILLUMINA | NaN | NaN | NaN | 2021-02-24 | SC |
| 8597 | PRJNA736036 | 7.197057e+07 | SAMN27058926 | 2020-11-13 | USA: Iowa | NaN | Illumina MiSeq sequencing; Amplicon sequencing... | NaN | This sample has been submitted by pda|wesley-h... | NaN | ... | 1425959-IA-M05216 | NaN | PAIRED | Illumina MiSeq | ILLUMINA | clinical | IA-SHL-1425959 | NaN | 2020-11-13 | SUB11247375 |
| 52307 | PRJNA716984 | 2.460988e+06 | SAMN23448018 | 2021-11-11 | USA: New Jersey | NaN | Sequel II sequencing | NaN | CDC Sars CoV2 Sequencing Baseline Constellation | NaN | ... | Unknown | Freed primers | PAIRED | Sequel II | PACBIO_SMRT | Nasal Swabs | SARS-CoV-2/Human/USA/NJ-CDC-LC0374318/2021 | NaN | 2021-11-11 | NaN |
5 rows × 32 columns
sra.columns.tolist()['study_accession',
'base_count',
'accession',
'collection_date',
'country',
'culture_collection',
'description',
'sample_collection',
'sample_title',
'sequencing_method',
'sample_material',
'sample_description',
'sample_accession',
'sample_capture_status',
'sample_alias',
'library_selection',
'location',
'run_accession',
'read_count',
'project_name',
'library_source',
'library_strategy',
'library_name',
'library_construction_protocol',
'library_layout',
'instrument_model',
'instrument_platform',
'isolation_source',
'isolate',
'investigation_type',
'collection_date_submitted',
'center_name']
sra['instrument_platform'].value_counts()instrument_platform
ILLUMINA 81692
PACBIO_SMRT 8431
OXFORD_NANOPORE 8066
ION_TORRENT 1790
BGISEQ 18
DNBSEQ 3
Name: count, dtype: int64
Clean dates
# Convert collection_date to datetime
# errors='coerce' turns unparseable dates into NaT (Not a Time)
sra = sra.assign(collection_date=pd.to_datetime(sra['collection_date'], errors='coerce'))print('Earliest entry:', sra['collection_date'].min())
print('Latest entry:', sra['collection_date'].max())Earliest entry: 2019-12-30 00:00:00
Latest entry: 2023-01-20 00:00:00
⚠️ Data Quality: Don’t get surprised here — the metadata is only as good as the person who entered it. When you enter metadata for your sequencing data, pay attention!
# Filter to valid date range
sra = sra[
(sra['collection_date'] >= pd.Timestamp('2020-01-01'))
&
(sra['collection_date'] <= pd.Timestamp('2023-12-31'))
]Aggregate for visualization
heatmap_2d = sra.groupby(
['instrument_platform', 'library_strategy']
).agg(
{'run_accession': 'nunique'}
).reset_index()
heatmap_2d| instrument_platform | library_strategy | run_accession | |
|---|---|---|---|
| 0 | BGISEQ | AMPLICON | 1 |
| 1 | BGISEQ | OTHER | 13 |
| 2 | BGISEQ | RNA-Seq | 2 |
| 3 | BGISEQ | Targeted-Capture | 2 |
| 4 | DNBSEQ | AMPLICON | 3 |
| 5 | ILLUMINA | AMPLICON | 78262 |
| 6 | ILLUMINA | OTHER | 2 |
| 7 | ILLUMINA | RNA-Seq | 524 |
| 8 | ILLUMINA | Targeted-Capture | 272 |
| 9 | ILLUMINA | WCS | 2 |
| 10 | ILLUMINA | WGA | 1389 |
| 11 | ILLUMINA | WGS | 977 |
| 12 | ILLUMINA | miRNA-Seq | 3 |
| 13 | ION_TORRENT | AMPLICON | 1752 |
| 14 | ION_TORRENT | RNA-Seq | 3 |
| 15 | ION_TORRENT | WGA | 11 |
| 16 | ION_TORRENT | WGS | 24 |
| 17 | OXFORD_NANOPORE | AMPLICON | 7305 |
| 18 | OXFORD_NANOPORE | RNA-Seq | 292 |
| 19 | OXFORD_NANOPORE | WGA | 134 |
| 20 | OXFORD_NANOPORE | WGS | 227 |
| 21 | PACBIO_SMRT | AMPLICON | 8379 |
| 22 | PACBIO_SMRT | RNA-Seq | 13 |
| 23 | PACBIO_SMRT | Targeted-Capture | 6 |
| 24 | PACBIO_SMRT | WGS | 1 |
Visualize with Altair
import altair as altback = alt.Chart(heatmap_2d).mark_rect(opacity=1).encode(
x=alt.X(
"instrument_platform:N",
title="Instrument"
),
y=alt.Y(
"library_strategy:N",
title="Strategy",
axis=alt.Axis(orient='right')
),
color=alt.Color(
"run_accession:Q",
title="# Samples",
scale=alt.Scale(
scheme="goldred",
type="log"
),
),
tooltip=[
alt.Tooltip("instrument_platform:N", title="Machine"),
alt.Tooltip("run_accession:Q", title="Number of runs"),
alt.Tooltip("library_strategy:N", title="Protocol")
]
).properties(
width=500,
height=150,
title={
"text": ["Breakdown of datasets from ENA",
"by Platform and Library Strategy"],
"subtitle": "(Sample of 100k records)"
}
)
back# Add text labels
front = back.mark_text(
align="center",
baseline="middle",
fontSize=12,
fontWeight="bold",
).encode(
text=alt.Text("run_accession:Q", format=",.0f"),
color=alt.condition(
alt.datum.run_accession > 200,
alt.value("white"),
alt.value("black")
)
)
# Combine layers
back + front