Open In Colab

Lecture 7: Data Manipulation with Pandas

This tutorial draws on material from: - Justin Bois - BIOS821 course at Duke - Pandas documentation

Pandas (from “Panel Data”) is a library for manipulating tabular data in Python. Its central object—the DataFrame—lets you filter, transform, aggregate, and reshape tables with concise, readable code. We will use a small genome statistics dataset throughout this lecture.

import pandas as pd

Creating a DataFrame

The simplest way to create a DataFrame is from a Python dictionary. Each key becomes a column name, and the corresponding list becomes the column values.

small_df = pd.DataFrame({
    'organism': ['E. coli', 'S. cerevisiae', 'H. sapiens'],
    'genome_size_mb': [4.6, 12.1, 3100.0],
    'num_genes': [4300, 6000, 20000]
})
small_df
organism genome_size_mb num_genes
0 E. coli 4.6 4300
1 S. cerevisiae 12.1 6000
2 H. sapiens 3100.0 20000
print('Shape:', small_df.shape)
print()
print(small_df.dtypes)
Shape: (3, 3)

organism           object
genome_size_mb    float64
num_genes           int64
dtype: object

Exercise

Create a DataFrame called viruses with 3 columns: name, genome_size_kb, and host. Include at least 3 rows of made-up or real virus data. Print its shape.

# Your code here

Reading CSV files

In practice, data lives in files. pd.read_csv() reads a CSV into a DataFrame. Here we load a genome statistics dataset:

df = pd.read_csv('https://raw.githubusercontent.com/nekrut/bda/main/lectures/data/genome_stats.csv')
df.head()
organism kingdom genome_size_mb gc_percent num_genes habitat
0 Escherichia coli Bacteria 4.6 50.8 4300 gut
1 Bacillus subtilis Bacteria 4.2 43.5 4100 soil
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated
3 Staphylococcus aureus Bacteria 2.8 32.9 2600 host-associated
4 Pseudomonas aeruginosa Bacteria 6.3 66.6 5700 soil
print('Shape:', df.shape)
print()
print(df.dtypes)
Shape: (20, 6)

organism           object
kingdom            object
genome_size_mb    float64
gc_percent        float64
num_genes           int64
habitat            object
dtype: object

Exercise

Use .tail() to show the last 3 rows of df. Then use .describe() to get summary statistics. Which organism has the largest genome?

# Your code here

Indexing

There are three main ways to select data from a DataFrame:

  • df['col'] — select a single column (returns a Series)
  • df.loc[row_label, col_label] — label-based indexing (uses actual row/column names)
  • df.iloc[row_pos, col_pos] — integer position-based indexing
# Select a single column
df['organism'].head()
0              Escherichia coli
1             Bacillus subtilis
2    Mycobacterium tuberculosis
3         Staphylococcus aureus
4        Pseudomonas aeruginosa
Name: organism, dtype: object
# Label-based: row 3, column 'gc_percent'
df.loc[3, 'gc_percent']
np.float64(32.9)
# Integer position-based: rows 2 through 4
df.iloc[2:5]
organism kingdom genome_size_mb gc_percent num_genes habitat
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated
3 Staphylococcus aureus Bacteria 2.8 32.9 2600 host-associated
4 Pseudomonas aeruginosa Bacteria 6.3 66.6 5700 soil

Exercise

Using df.loc, extract the organism and habitat columns for rows 5 through 9. Then use df.iloc to get the value in the 4th row, 3rd column.

# Your code here

Filtering

Boolean indexing lets you select rows that match a condition. The expression inside the brackets produces a True/False Series, and only rows where the value is True are kept.

Note: df[condition] is shorthand for df.loc[condition]—both are acceptable. We use the shorter form here for readability.

# All bacteria
df[df['kingdom'] == 'Bacteria']
organism kingdom genome_size_mb gc_percent num_genes habitat
0 Escherichia coli Bacteria 4.6 50.8 4300 gut
1 Bacillus subtilis Bacteria 4.2 43.5 4100 soil
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated
3 Staphylococcus aureus Bacteria 2.8 32.9 2600 host-associated
4 Pseudomonas aeruginosa Bacteria 6.3 66.6 5700 soil
5 Streptococcus pneumoniae Bacteria 2.2 39.7 2100 host-associated
6 Vibrio cholerae Bacteria 4.0 47.5 3800 aquatic
7 Clostridioides difficile Bacteria 4.3 29.1 3800 gut
# Bacteria with GC content above 50%
df[(df['kingdom'] == 'Bacteria') & (df['gc_percent'] > 50)]
organism kingdom genome_size_mb gc_percent num_genes habitat
0 Escherichia coli Bacteria 4.6 50.8 4300 gut
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated
4 Pseudomonas aeruginosa Bacteria 6.3 66.6 5700 soil

Exercise

Filter df to show only organisms that live in "aquatic" habitats. Then find all Eukarya with more than 20,000 genes.

# Your code here

Adding columns

You can create new columns from existing ones. Pandas applies operations element-wise (vectorization)—no for loop needed.

Note: This permanently adds the column to df. All subsequent cells will see 7 columns instead of 6.

df['genes_per_mb'] = df['num_genes'] / df['genome_size_mb']
df[['organism', 'genome_size_mb', 'num_genes', 'genes_per_mb']].head()
organism genome_size_mb num_genes genes_per_mb
0 Escherichia coli 4.6 4300 934.782609
1 Bacillus subtilis 4.2 4100 976.190476
2 Mycobacterium tuberculosis 4.4 4000 909.090909
3 Staphylococcus aureus 2.8 2600 928.571429
4 Pseudomonas aeruginosa 6.3 5700 904.761905

Exercise

Add a column gc_category that is "high" when gc_percent > 50 and "low" otherwise. Hint: use np.where() — you’ll need import numpy as np.

# Your code here

Writing CSV

Use df.to_csv() to save a DataFrame back to disk. Setting index=False prevents Pandas from writing the row numbers as an extra unnamed column—without it, the CSV gets cluttered with a meaningless index.

df.to_csv('genome_stats_with_density.csv', index=False)
# Verify round-trip
pd.read_csv('genome_stats_with_density.csv').head(3)
organism kingdom genome_size_mb gc_percent num_genes habitat genes_per_mb
0 Escherichia coli Bacteria 4.6 50.8 4300 gut 934.782609
1 Bacillus subtilis Bacteria 4.2 43.5 4100 soil 976.190476
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated 909.090909

Exploring categories

Before discussing data organization principles, let’s explore what categories exist in our dataset.

df['kingdom'].unique()
array(['Bacteria', 'Eukarya', 'Archaea'], dtype=object)
df['kingdom'].value_counts()
kingdom
Bacteria    8
Eukarya     8
Archaea     4
Name: count, dtype: int64
df['habitat'].value_counts()
habitat
aquatic            5
terrestrial        5
soil               4
host-associated    4
gut                2
Name: count, dtype: int64

Exercise

How many unique habitats are there? Which habitat has the fewest organisms? Use .unique() and .value_counts().

# Your code here

Tidy data

Hadley Wickham defined tidy data with three rules:

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each type of observation has its own separate data frame.

Tidy data is far easier to filter, aggregate, and plot.

Our genome_stats.csv is already tidy: each row is one organism (an observation), each column is a single variable (kingdom, genome size, etc.), and all rows describe the same type of thing—genome statistics.

Wide vs long format

Data often arrives in wide format—one column per experimental condition. This is convenient for spreadsheets but violates tidy principles because the column headers contain data (condition names).

Below is a small gene expression dataset. The values represent expression levels measured in FPKM (Fragments Per Kilobase of transcript per Million mapped reads)—a common unit for RNA-seq experiments.

expr = pd.read_csv('https://raw.githubusercontent.com/nekrut/bda/main/lectures/data/gene_expression_wide.csv')
expr
gene control heat_shock starvation
0 geneA 5.2 12.1 3.8
1 geneB 8.7 7.9 2.1
2 geneC 1.3 9.4 6.5
3 geneD 4.0 4.2 11.3

The condition columns (control, heat_shock, starvation) are really values of a variable we might call condition. We use pd.melt() to reshape this into tidy (long) format.

Pandas melt operation diagram

Key parameters of pd.melt(): - id_vars — columns to keep as identifiers (not melted) - var_name — name for the new column created from the old column headers - value_name — name for the new column holding the cell values

expr_long = pd.melt(
    expr,
    id_vars=['gene'],
    var_name='condition',
    value_name='expression'
)
expr_long
gene condition expression
0 geneA control 5.2
1 geneB control 8.7
2 geneC control 1.3
3 geneD control 4.0
4 geneA heat_shock 12.1
5 geneB heat_shock 7.9
6 geneC heat_shock 9.4
7 geneD heat_shock 4.2
8 geneA starvation 3.8
9 geneB starvation 2.1
10 geneC starvation 6.5
11 geneD starvation 11.3

To go back from long to wide, use .pivot():

Pandas pivot operation diagram
expr_long.pivot(index='gene', columns='condition', values='expression')
condition control heat_shock starvation
gene
geneA 5.2 12.1 3.8
geneB 8.7 7.9 2.1
geneC 1.3 9.4 6.5
geneD 4.0 4.2 11.3

Exercise

The DataFrame below is in wide format. Melt it into tidy (long) format, then pivot it back to wide.

temps = pd.DataFrame({
    'city': ['NYC', 'LA', 'Chicago'],
    'jan': [-1, 14, -5],
    'jul': [28, 24, 27]
})
# Your code here

Sorting

Use sort_values() to order rows by one or more columns. By default sorting is ascending; pass ascending=False to reverse.

df.sort_values('genome_size_mb')
organism kingdom genome_size_mb gc_percent num_genes habitat genes_per_mb
17 Methanocaldococcus jannaschii Archaea 1.7 31.3 1700 aquatic 1000.000000
19 Thermococcus kodakarensis Archaea 2.1 52.0 2300 aquatic 1095.238095
5 Streptococcus pneumoniae Bacteria 2.2 39.7 2100 host-associated 954.545455
18 Sulfolobus acidocaldarius Archaea 2.2 36.7 2200 terrestrial 1000.000000
16 Halobacterium salinarum Archaea 2.6 65.9 2600 aquatic 1000.000000
3 Staphylococcus aureus Bacteria 2.8 32.9 2600 host-associated 928.571429
6 Vibrio cholerae Bacteria 4.0 47.5 3800 aquatic 950.000000
1 Bacillus subtilis Bacteria 4.2 43.5 4100 soil 976.190476
7 Clostridioides difficile Bacteria 4.3 29.1 3800 gut 883.720930
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated 909.090909
0 Escherichia coli Bacteria 4.6 50.8 4300 gut 934.782609
4 Pseudomonas aeruginosa Bacteria 6.3 66.6 5700 soil 904.761905
8 Saccharomyces cerevisiae Eukarya 12.1 38.3 6000 soil 495.867769
9 Candida auris Eukarya 12.4 44.5 5400 host-associated 435.483871
11 Caenorhabditis elegans Eukarya 100.3 35.4 20000 soil 199.401795
12 Arabidopsis thaliana Eukarya 135.0 36.0 27000 terrestrial 200.000000
10 Drosophila melanogaster Eukarya 144.0 42.5 14000 terrestrial 97.222222
15 Danio rerio Eukarya 1400.0 36.6 25000 aquatic 17.857143
14 Mus musculus Eukarya 2700.0 42.0 22000 terrestrial 8.148148
13 Homo sapiens Eukarya 3100.0 40.9 20000 terrestrial 6.451613
# Multi-column sort: kingdom ascending, genome size descending
df.sort_values(
    by=['kingdom', 'genome_size_mb'],
    ascending=[True, False]
)
organism kingdom genome_size_mb gc_percent num_genes habitat genes_per_mb
16 Halobacterium salinarum Archaea 2.6 65.9 2600 aquatic 1000.000000
18 Sulfolobus acidocaldarius Archaea 2.2 36.7 2200 terrestrial 1000.000000
19 Thermococcus kodakarensis Archaea 2.1 52.0 2300 aquatic 1095.238095
17 Methanocaldococcus jannaschii Archaea 1.7 31.3 1700 aquatic 1000.000000
4 Pseudomonas aeruginosa Bacteria 6.3 66.6 5700 soil 904.761905
0 Escherichia coli Bacteria 4.6 50.8 4300 gut 934.782609
2 Mycobacterium tuberculosis Bacteria 4.4 65.6 4000 host-associated 909.090909
7 Clostridioides difficile Bacteria 4.3 29.1 3800 gut 883.720930
1 Bacillus subtilis Bacteria 4.2 43.5 4100 soil 976.190476
6 Vibrio cholerae Bacteria 4.0 47.5 3800 aquatic 950.000000
3 Staphylococcus aureus Bacteria 2.8 32.9 2600 host-associated 928.571429
5 Streptococcus pneumoniae Bacteria 2.2 39.7 2100 host-associated 954.545455
13 Homo sapiens Eukarya 3100.0 40.9 20000 terrestrial 6.451613
14 Mus musculus Eukarya 2700.0 42.0 22000 terrestrial 8.148148
15 Danio rerio Eukarya 1400.0 36.6 25000 aquatic 17.857143
10 Drosophila melanogaster Eukarya 144.0 42.5 14000 terrestrial 97.222222
12 Arabidopsis thaliana Eukarya 135.0 36.0 27000 terrestrial 200.000000
11 Caenorhabditis elegans Eukarya 100.3 35.4 20000 soil 199.401795
9 Candida auris Eukarya 12.4 44.5 5400 host-associated 435.483871
8 Saccharomyces cerevisiae Eukarya 12.1 38.3 6000 soil 495.867769

Exercise

Sort df by num_genes in descending order. Which organism has the most genes? Then sort by habitat ascending and gc_percent descending.

# Your code here

Split-apply-combine

Many analyses follow a three-step pattern described by Hadley Wickham:

  1. Split the data into groups (e.g., by kingdom)
  2. Apply a function to each group (e.g., compute the mean)
  3. Combine the results into a new table

In Pandas this is done with groupby().

Aggregation

df.groupby('kingdom')['genome_size_mb'].mean()
kingdom
Archaea       2.150
Bacteria      4.100
Eukarya     950.475
Name: genome_size_mb, dtype: float64
df.groupby('kingdom')['genome_size_mb'].describe()
count mean std min 25% 50% 75% max
kingdom
Archaea 4.0 2.150 0.369685 1.7 2.000 2.15 2.30 2.6
Bacteria 8.0 4.100 1.227076 2.2 3.700 4.25 4.45 6.3
Eukarya 8.0 950.475 1291.848037 12.1 78.325 139.50 1725.00 3100.0
# Group by two columns
df.groupby(['kingdom', 'habitat'])['num_genes'].mean()
kingdom   habitat        
Archaea   aquatic             2200.0
          terrestrial         2200.0
Bacteria  aquatic             3800.0
          gut                 4050.0
          host-associated     2900.0
          soil                4900.0
Eukarya   aquatic            25000.0
          host-associated     5400.0
          soil               13000.0
          terrestrial        20750.0
Name: num_genes, dtype: float64

Exercise

Use groupby to find the maximum gc_percent for each kingdom. Then compute the mean num_genes grouped by habitat.

# Your code here

Working with multiple tables

Real analyses often require combining information from different sources. pd.merge() joins two DataFrames on a shared key—similar to SQL joins.

Pandas merge left join diagram

Let’s create two small DataFrames to demonstrate different join types.

taxonomy = pd.DataFrame({
    'organism': ['Escherichia coli', 'Saccharomyces cerevisiae',
                 'Homo sapiens', 'Halobacterium salinarum'],
    'phylum': ['Pseudomonadota', 'Ascomycota',
               'Chordata', 'Euryarchaeota']
})
taxonomy
organism phylum
0 Escherichia coli Pseudomonadota
1 Saccharomyces cerevisiae Ascomycota
2 Homo sapiens Chordata
3 Halobacterium salinarum Euryarchaeota
isolation = pd.DataFrame({
    'organism': ['Escherichia coli', 'Bacillus subtilis',
                 'Homo sapiens', 'Drosophila melanogaster'],
    'first_sequenced': [1997, 1997, 2001, 2000]
})
isolation
organism first_sequenced
0 Escherichia coli 1997
1 Bacillus subtilis 1997
2 Homo sapiens 2001
3 Drosophila melanogaster 2000

Inner join

Keeps only rows whose key appears in both tables.

pd.merge(taxonomy, isolation, on='organism')
organism phylum first_sequenced
0 Escherichia coli Pseudomonadota 1997
1 Homo sapiens Chordata 2001

Left join

Keeps all rows from the left table; fills missing matches with NaN.

pd.merge(taxonomy, isolation, on='organism', how='left')
organism phylum first_sequenced
0 Escherichia coli Pseudomonadota 1997.0
1 Saccharomyces cerevisiae Ascomycota NaN
2 Homo sapiens Chordata 2001.0
3 Halobacterium salinarum Euryarchaeota NaN

What is NaN? NaN stands for “Not a Number” and represents missing data. When a left join finds no matching row in the right table, Pandas fills those cells with NaN.

Right join

Keeps all rows from the right table.

pd.merge(taxonomy, isolation, on='organism', how='right')
organism phylum first_sequenced
0 Escherichia coli Pseudomonadota 1997
1 Bacillus subtilis NaN 1997
2 Homo sapiens Chordata 2001
3 Drosophila melanogaster NaN 2000

Outer join

Keeps all rows from both tables.

pd.merge(taxonomy, isolation, on='organism', how='outer')
organism phylum first_sequenced
0 Bacillus subtilis NaN 1997.0
1 Drosophila melanogaster NaN 2000.0
2 Escherichia coli Pseudomonadota 1997.0
3 Halobacterium salinarum Euryarchaeota NaN
4 Homo sapiens Chordata 2001.0
5 Saccharomyces cerevisiae Ascomycota NaN

Exercise

Create two DataFrames and merge them:

habitats = pd.DataFrame({
    'habitat': ['gut', 'soil', 'aquatic'],
    'temperature_c': [37, 20, 15]
})

Merge habitats with df using a left join on 'habitat'. How many rows have NaN for temperature_c? Why?

# Your code here

Visualization with Altair

Let’s make a quick bar chart of mean genome size by kingdom using groupby + Altair.

Altair uses single-letter encoding type suffixes: - :N — nominal (categorical, unordered) data like kingdom names - :Q — quantitative (numeric, continuous) data like genome size - :O — ordinal (ordered categories) and :T — temporal (dates/times)

import altair as alt

kingdom_means = df.groupby('kingdom')['genome_size_mb'].mean().reset_index()

alt.Chart(kingdom_means).mark_bar().encode(
    x=alt.X('kingdom:N', title='Kingdom'),
    y=alt.Y('genome_size_mb:Q', title='Mean genome size (Mb)'),
    color='kingdom:N'
).properties(
    width=300,
    title='Mean Genome Size by Kingdom'
)

Eukarya genomes are orders of magnitude larger than Bacteria and Archaea, making the smaller kingdoms nearly invisible on a linear scale. A logarithmic y-axis lets us compare all three kingdoms meaningfully.

alt.Chart(kingdom_means).mark_bar().encode(
    x=alt.X('kingdom:N', title='Kingdom'),
    y=alt.Y('genome_size_mb:Q', title='Mean genome size (Mb)', scale=alt.Scale(type='sqrt', domainMin=1)),
    color='kingdom:N'
).properties(
    width=300,
    title='Mean Genome Size by Kingdom (log scale)'
)

Exercise

Create a scatter plot of genome_size_mb (x) vs num_genes (y), colored by kingdom. Use mark_point() instead of mark_bar().

# Your code here

Summary

  • DataFrame basics: create from dict, read CSV, inspect with .head(), .shape, .dtypes
  • Indexing: [], .loc (label), .iloc (position)
  • Filtering: boolean indexing with & for compound conditions
  • New columns: vectorized arithmetic—no loops needed
  • Tidy data: each variable a column, each observation a row
  • Reshaping: melt() (wide → long), pivot() (long → wide)
  • Sorting: sort_values() with multi-column and mixed ascending/descending
  • Split-apply-combine: groupby().mean(), .describe(), .agg()
  • Joins: pd.merge() with inner, left, right, outer
  • Visualization: Altair integrates directly with DataFrames

A more realistic example

The Sequence Read Archive (SRA) is the largest public repository of sequencing data, mirrored by the European Nucleotide Archive (ENA). Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.

Setup

import pandas as pd

We use a pre-compiled ENA metadata snapshot hosted on Zenodo. The file contains ~800k records; we load the first 100k for speed.

sra = pd.read_csv(
    "https://zenodo.org/records/10680776/files/ena.tsv.gz",
    compression='gzip',
    sep="\t",
    low_memory=False,
    nrows=100000
)

Explore the data

len(sra)
100000
sra.sample(5)
study_accession base_count accession collection_date country culture_collection description sample_collection sample_title sequencing_method ... library_name library_construction_protocol library_layout instrument_model instrument_platform isolation_source isolate investigation_type collection_date_submitted center_name
98517 PRJEB37886 1.021151e+09 SAMEA10938693 2021-10-29 United Kingdom NaN Illumina NovaSeq 6000 sequencing; Illumina Nov... NaN COG-UK/NEWC-28D0557 NaN ... NT1705557N / HT-122795:H9 NaN PAIRED Illumina NovaSeq 6000 ILLUMINA NaN NaN NaN 2021-10-29 SC
30031 PRJEB43060 2.903531e+08 SAMEA12516416 2021-08-04 Norway NaN Illumina NovaSeq 6000 sequencing; Raw reads: S... NaN SARS-CoV-2/human/Norway/16523/2021/1 NaN ... NaN NaN PAIRED Illumina NovaSeq 6000 ILLUMINA upper respiratory secretion SARS-CoV-2/human/Norway/16523/2021 NaN 2021-08-04 Norwegian Institute of Public Health (NIPH)
60679 PRJEB37886 8.712552e+08 SAMEA8237191 2021-02-24 United Kingdom NaN Illumina NovaSeq 6000 sequencing; Illumina Nov... NaN COG-UK/CAMC-1359F50 NaN ... NT1665242V / DN785905B:A10 NaN PAIRED Illumina NovaSeq 6000 ILLUMINA NaN NaN NaN 2021-02-24 SC
8597 PRJNA736036 7.197057e+07 SAMN27058926 2020-11-13 USA: Iowa NaN Illumina MiSeq sequencing; Amplicon sequencing... NaN This sample has been submitted by pda|wesley-h... NaN ... 1425959-IA-M05216 NaN PAIRED Illumina MiSeq ILLUMINA clinical IA-SHL-1425959 NaN 2020-11-13 SUB11247375
52307 PRJNA716984 2.460988e+06 SAMN23448018 2021-11-11 USA: New Jersey NaN Sequel II sequencing NaN CDC Sars CoV2 Sequencing Baseline Constellation NaN ... Unknown Freed primers PAIRED Sequel II PACBIO_SMRT Nasal Swabs SARS-CoV-2/Human/USA/NJ-CDC-LC0374318/2021 NaN 2021-11-11 NaN

5 rows × 32 columns

sra.columns.tolist()
['study_accession',
 'base_count',
 'accession',
 'collection_date',
 'country',
 'culture_collection',
 'description',
 'sample_collection',
 'sample_title',
 'sequencing_method',
 'sample_material',
 'sample_description',
 'sample_accession',
 'sample_capture_status',
 'sample_alias',
 'library_selection',
 'location',
 'run_accession',
 'read_count',
 'project_name',
 'library_source',
 'library_strategy',
 'library_name',
 'library_construction_protocol',
 'library_layout',
 'instrument_model',
 'instrument_platform',
 'isolation_source',
 'isolate',
 'investigation_type',
 'collection_date_submitted',
 'center_name']
sra['instrument_platform'].value_counts()
instrument_platform
ILLUMINA           81692
PACBIO_SMRT         8431
OXFORD_NANOPORE     8066
ION_TORRENT         1790
BGISEQ                18
DNBSEQ                 3
Name: count, dtype: int64

Clean dates

# Convert collection_date to datetime
# errors='coerce' turns unparseable dates into NaT (Not a Time)
sra = sra.assign(collection_date=pd.to_datetime(sra['collection_date'], errors='coerce'))
print('Earliest entry:', sra['collection_date'].min())
print('Latest entry:', sra['collection_date'].max())
Earliest entry: 2019-12-30 00:00:00
Latest entry: 2023-01-20 00:00:00

⚠️ Data Quality: Don’t get surprised here — the metadata is only as good as the person who entered it. When you enter metadata for your sequencing data, pay attention!

# Filter to valid date range
sra = sra[
    (sra['collection_date'] >= pd.Timestamp('2020-01-01'))
    &
    (sra['collection_date'] <= pd.Timestamp('2023-12-31'))
]

Aggregate for visualization

heatmap_2d = sra.groupby(
    ['instrument_platform', 'library_strategy']
).agg(
    {'run_accession': 'nunique'}
).reset_index()

heatmap_2d
instrument_platform library_strategy run_accession
0 BGISEQ AMPLICON 1
1 BGISEQ OTHER 13
2 BGISEQ RNA-Seq 2
3 BGISEQ Targeted-Capture 2
4 DNBSEQ AMPLICON 3
5 ILLUMINA AMPLICON 78262
6 ILLUMINA OTHER 2
7 ILLUMINA RNA-Seq 524
8 ILLUMINA Targeted-Capture 272
9 ILLUMINA WCS 2
10 ILLUMINA WGA 1389
11 ILLUMINA WGS 977
12 ILLUMINA miRNA-Seq 3
13 ION_TORRENT AMPLICON 1752
14 ION_TORRENT RNA-Seq 3
15 ION_TORRENT WGA 11
16 ION_TORRENT WGS 24
17 OXFORD_NANOPORE AMPLICON 7305
18 OXFORD_NANOPORE RNA-Seq 292
19 OXFORD_NANOPORE WGA 134
20 OXFORD_NANOPORE WGS 227
21 PACBIO_SMRT AMPLICON 8379
22 PACBIO_SMRT RNA-Seq 13
23 PACBIO_SMRT Targeted-Capture 6
24 PACBIO_SMRT WGS 1

Visualize with Altair

import altair as alt
back = alt.Chart(heatmap_2d).mark_rect(opacity=1).encode(
    x=alt.X(
        "instrument_platform:N",
        title="Instrument"
    ),
    y=alt.Y(
        "library_strategy:N",
        title="Strategy",
        axis=alt.Axis(orient='right')
    ),
    color=alt.Color(
        "run_accession:Q",
        title="# Samples",
        scale=alt.Scale(
            scheme="goldred",
            type="log"
        ),
    ),
    tooltip=[
        alt.Tooltip("instrument_platform:N", title="Machine"),
        alt.Tooltip("run_accession:Q", title="Number of runs"),
        alt.Tooltip("library_strategy:N", title="Protocol")
    ]
).properties(
    width=500,
    height=150,
    title={
        "text": ["Breakdown of datasets from ENA",
                 "by Platform and Library Strategy"],
        "subtitle": "(Sample of 100k records)"
    }
)

back
# Add text labels
front = back.mark_text(
    align="center",
    baseline="middle",
    fontSize=12,
    fontWeight="bold",
).encode(
    text=alt.Text("run_accession:Q", format=",.0f"),
    color=alt.condition(
        alt.datum.run_accession > 200,
        alt.value("white"),
        alt.value("black")
    )
)

# Combine layers
back + front