Open In Colab

Lecture 7: Data Manipulation with Pandas

This tutorial draws on material from: - Justin Bois - BIOS821 course at Duke - Pandas documentation

Pandas (from “Panel Data”) is a library for manipulating tabular data in Python. Its central object—the DataFrame—lets you filter, transform, aggregate, and reshape tables with concise, readable code. We will use a small genome statistics dataset throughout this lecture.

import pandas as pd

Creating a DataFrame

The simplest way to create a DataFrame is from a Python dictionary. Each key becomes a column name, and the corresponding list becomes the column values.

small_df = pd.DataFrame({
    'organism': ['E. coli', 'S. cerevisiae', 'H. sapiens'],
    'genome_size_mb': [4.6, 12.1, 3100.0],
    'num_genes': [4300, 6000, 20000]
})
small_df

	organism	genome_size_mb	num_genes
0	E. coli	4.6	4300
1	S. cerevisiae	12.1	6000
2	H. sapiens	3100.0	20000

print('Shape:', small_df.shape)
print()
print(small_df.dtypes)

Shape: (3, 3)

organism           object
genome_size_mb    float64
num_genes           int64
dtype: object

Exercise

Create a DataFrame called viruses with 3 columns: name, genome_size_kb, and host. Include at least 3 rows of made-up or real virus data. Print its shape.

# Your code here

Reading CSV files

In practice, data lives in files. pd.read_csv() reads a CSV into a DataFrame. Here we load a genome statistics dataset:

df = pd.read_csv('https://raw.githubusercontent.com/nekrut/bda/main/lectures/data/genome_stats.csv')
df.head()

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat
0	Escherichia coli	Bacteria	4.6	50.8	4300	gut
1	Bacillus subtilis	Bacteria	4.2	43.5	4100	soil
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated
3	Staphylococcus aureus	Bacteria	2.8	32.9	2600	host-associated
4	Pseudomonas aeruginosa	Bacteria	6.3	66.6	5700	soil

print('Shape:', df.shape)
print()
print(df.dtypes)

Shape: (20, 6)

organism           object
kingdom            object
genome_size_mb    float64
gc_percent        float64
num_genes           int64
habitat            object
dtype: object

Exercise

Use .tail() to show the last 3 rows of df. Then use .describe() to get summary statistics. Which organism has the largest genome?

# Your code here

Indexing

There are three main ways to select data from a DataFrame:

df['col'] — select a single column (returns a Series)
df.loc[row_label, col_label] — label-based indexing (uses actual row/column names)
df.iloc[row_pos, col_pos] — integer position-based indexing

# Select a single column
df['organism'].head()

0              Escherichia coli
1             Bacillus subtilis
2    Mycobacterium tuberculosis
3         Staphylococcus aureus
4        Pseudomonas aeruginosa
Name: organism, dtype: object

# Label-based: row 3, column 'gc_percent'
df.loc[3, 'gc_percent']

np.float64(32.9)

# Integer position-based: rows 2 through 4
df.iloc[2:5]

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated
3	Staphylococcus aureus	Bacteria	2.8	32.9	2600	host-associated
4	Pseudomonas aeruginosa	Bacteria	6.3	66.6	5700	soil

Exercise

Using df.loc, extract the organism and habitat columns for rows 5 through 9. Then use df.iloc to get the value in the 4th row, 3rd column.

# Your code here

Filtering

Boolean indexing lets you select rows that match a condition. The expression inside the brackets produces a True/False Series, and only rows where the value is True are kept.

Note: df[condition] is shorthand for df.loc[condition]—both are acceptable. We use the shorter form here for readability.

# All bacteria
df[df['kingdom'] == 'Bacteria']

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat
0	Escherichia coli	Bacteria	4.6	50.8	4300	gut
1	Bacillus subtilis	Bacteria	4.2	43.5	4100	soil
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated
3	Staphylococcus aureus	Bacteria	2.8	32.9	2600	host-associated
4	Pseudomonas aeruginosa	Bacteria	6.3	66.6	5700	soil
5	Streptococcus pneumoniae	Bacteria	2.2	39.7	2100	host-associated
6	Vibrio cholerae	Bacteria	4.0	47.5	3800	aquatic
7	Clostridioides difficile	Bacteria	4.3	29.1	3800	gut

# Bacteria with GC content above 50%
df[(df['kingdom'] == 'Bacteria') & (df['gc_percent'] > 50)]

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat
0	Escherichia coli	Bacteria	4.6	50.8	4300	gut
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated
4	Pseudomonas aeruginosa	Bacteria	6.3	66.6	5700	soil

Exercise

Filter df to show only organisms that live in "aquatic" habitats. Then find all Eukarya with more than 20,000 genes.

# Your code here

Adding columns

You can create new columns from existing ones. Pandas applies operations element-wise (vectorization)—no for loop needed.

Note: This permanently adds the column to df. All subsequent cells will see 7 columns instead of 6.

df['genes_per_mb'] = df['num_genes'] / df['genome_size_mb']
df[['organism', 'genome_size_mb', 'num_genes', 'genes_per_mb']].head()

	organism	genome_size_mb	num_genes	genes_per_mb
0	Escherichia coli	4.6	4300	934.782609
1	Bacillus subtilis	4.2	4100	976.190476
2	Mycobacterium tuberculosis	4.4	4000	909.090909
3	Staphylococcus aureus	2.8	2600	928.571429
4	Pseudomonas aeruginosa	6.3	5700	904.761905

Exercise

Add a column gc_category that is "high" when gc_percent > 50 and "low" otherwise. Hint: use np.where() — you’ll need import numpy as np.

# Your code here

Writing CSV

Use df.to_csv() to save a DataFrame back to disk. Setting index=False prevents Pandas from writing the row numbers as an extra unnamed column—without it, the CSV gets cluttered with a meaningless index.

df.to_csv('genome_stats_with_density.csv', index=False)
# Verify round-trip
pd.read_csv('genome_stats_with_density.csv').head(3)

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat	genes_per_mb
0	Escherichia coli	Bacteria	4.6	50.8	4300	gut	934.782609
1	Bacillus subtilis	Bacteria	4.2	43.5	4100	soil	976.190476
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated	909.090909

Exploring categories

Before discussing data organization principles, let’s explore what categories exist in our dataset.

df['kingdom'].unique()

array(['Bacteria', 'Eukarya', 'Archaea'], dtype=object)

df['kingdom'].value_counts()

kingdom
Bacteria    8
Eukarya     8
Archaea     4
Name: count, dtype: int64

df['habitat'].value_counts()

habitat
aquatic            5
terrestrial        5
soil               4
host-associated    4
gut                2
Name: count, dtype: int64

Exercise

How many unique habitats are there? Which habitat has the fewest organisms? Use .unique() and .value_counts().

# Your code here

Tidy data

Hadley Wickham defined tidy data with three rules:

Each variable is a column.
Each observation is a row.
Each type of observation has its own separate data frame.

Tidy data is far easier to filter, aggregate, and plot.

Our genome_stats.csv is already tidy: each row is one organism (an observation), each column is a single variable (kingdom, genome size, etc.), and all rows describe the same type of thing—genome statistics.

Wide vs long format

Data often arrives in wide format—one column per experimental condition. This is convenient for spreadsheets but violates tidy principles because the column headers contain data (condition names).

Below is a small gene expression dataset. The values represent expression levels measured in FPKM (Fragments Per Kilobase of transcript per Million mapped reads)—a common unit for RNA-seq experiments.

expr = pd.read_csv('https://raw.githubusercontent.com/nekrut/bda/main/lectures/data/gene_expression_wide.csv')
expr

	gene	control	heat_shock	starvation
0	geneA	5.2	12.1	3.8
1	geneB	8.7	7.9	2.1
2	geneC	1.3	9.4	6.5
3	geneD	4.0	4.2	11.3

The condition columns (control, heat_shock, starvation) are really values of a variable we might call condition. We use pd.melt() to reshape this into tidy (long) format.

Key parameters of pd.melt(): - id_vars — columns to keep as identifiers (not melted) - var_name — name for the new column created from the old column headers - value_name — name for the new column holding the cell values

expr_long = pd.melt(
    expr,
    id_vars=['gene'],
    var_name='condition',
    value_name='expression'
)
expr_long

	gene	condition	expression
0	geneA	control	5.2
1	geneB	control	8.7
2	geneC	control	1.3
3	geneD	control	4.0
4	geneA	heat_shock	12.1
5	geneB	heat_shock	7.9
6	geneC	heat_shock	9.4
7	geneD	heat_shock	4.2
8	geneA	starvation	3.8
9	geneB	starvation	2.1
10	geneC	starvation	6.5
11	geneD	starvation	11.3

To go back from long to wide, use .pivot():

expr_long.pivot(index='gene', columns='condition', values='expression')

condition	control	heat_shock	starvation
gene
geneA	5.2	12.1	3.8
geneB	8.7	7.9	2.1
geneC	1.3	9.4	6.5
geneD	4.0	4.2	11.3

Exercise

The DataFrame below is in wide format. Melt it into tidy (long) format, then pivot it back to wide.

temps = pd.DataFrame({
    'city': ['NYC', 'LA', 'Chicago'],
    'jan': [-1, 14, -5],
    'jul': [28, 24, 27]
})

# Your code here

Sorting

Use sort_values() to order rows by one or more columns. By default sorting is ascending; pass ascending=False to reverse.

df.sort_values('genome_size_mb')

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat	genes_per_mb
17	Methanocaldococcus jannaschii	Archaea	1.7	31.3	1700	aquatic	1000.000000
19	Thermococcus kodakarensis	Archaea	2.1	52.0	2300	aquatic	1095.238095
5	Streptococcus pneumoniae	Bacteria	2.2	39.7	2100	host-associated	954.545455
18	Sulfolobus acidocaldarius	Archaea	2.2	36.7	2200	terrestrial	1000.000000
16	Halobacterium salinarum	Archaea	2.6	65.9	2600	aquatic	1000.000000
3	Staphylococcus aureus	Bacteria	2.8	32.9	2600	host-associated	928.571429
6	Vibrio cholerae	Bacteria	4.0	47.5	3800	aquatic	950.000000
1	Bacillus subtilis	Bacteria	4.2	43.5	4100	soil	976.190476
7	Clostridioides difficile	Bacteria	4.3	29.1	3800	gut	883.720930
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated	909.090909
0	Escherichia coli	Bacteria	4.6	50.8	4300	gut	934.782609
4	Pseudomonas aeruginosa	Bacteria	6.3	66.6	5700	soil	904.761905
8	Saccharomyces cerevisiae	Eukarya	12.1	38.3	6000	soil	495.867769
9	Candida auris	Eukarya	12.4	44.5	5400	host-associated	435.483871
11	Caenorhabditis elegans	Eukarya	100.3	35.4	20000	soil	199.401795
12	Arabidopsis thaliana	Eukarya	135.0	36.0	27000	terrestrial	200.000000
10	Drosophila melanogaster	Eukarya	144.0	42.5	14000	terrestrial	97.222222
15	Danio rerio	Eukarya	1400.0	36.6	25000	aquatic	17.857143
14	Mus musculus	Eukarya	2700.0	42.0	22000	terrestrial	8.148148
13	Homo sapiens	Eukarya	3100.0	40.9	20000	terrestrial	6.451613

# Multi-column sort: kingdom ascending, genome size descending
df.sort_values(
    by=['kingdom', 'genome_size_mb'],
    ascending=[True, False]
)

	organism	kingdom	genome_size_mb	gc_percent	num_genes	habitat	genes_per_mb
16	Halobacterium salinarum	Archaea	2.6	65.9	2600	aquatic	1000.000000
18	Sulfolobus acidocaldarius	Archaea	2.2	36.7	2200	terrestrial	1000.000000
19	Thermococcus kodakarensis	Archaea	2.1	52.0	2300	aquatic	1095.238095
17	Methanocaldococcus jannaschii	Archaea	1.7	31.3	1700	aquatic	1000.000000
4	Pseudomonas aeruginosa	Bacteria	6.3	66.6	5700	soil	904.761905
0	Escherichia coli	Bacteria	4.6	50.8	4300	gut	934.782609
2	Mycobacterium tuberculosis	Bacteria	4.4	65.6	4000	host-associated	909.090909
7	Clostridioides difficile	Bacteria	4.3	29.1	3800	gut	883.720930
1	Bacillus subtilis	Bacteria	4.2	43.5	4100	soil	976.190476
6	Vibrio cholerae	Bacteria	4.0	47.5	3800	aquatic	950.000000
3	Staphylococcus aureus	Bacteria	2.8	32.9	2600	host-associated	928.571429
5	Streptococcus pneumoniae	Bacteria	2.2	39.7	2100	host-associated	954.545455
13	Homo sapiens	Eukarya	3100.0	40.9	20000	terrestrial	6.451613
14	Mus musculus	Eukarya	2700.0	42.0	22000	terrestrial	8.148148
15	Danio rerio	Eukarya	1400.0	36.6	25000	aquatic	17.857143
10	Drosophila melanogaster	Eukarya	144.0	42.5	14000	terrestrial	97.222222
12	Arabidopsis thaliana	Eukarya	135.0	36.0	27000	terrestrial	200.000000
11	Caenorhabditis elegans	Eukarya	100.3	35.4	20000	soil	199.401795
9	Candida auris	Eukarya	12.4	44.5	5400	host-associated	435.483871
8	Saccharomyces cerevisiae	Eukarya	12.1	38.3	6000	soil	495.867769

Exercise

Sort df by num_genes in descending order. Which organism has the most genes? Then sort by habitat ascending and gc_percent descending.

# Your code here

Split-apply-combine

Many analyses follow a three-step pattern described by Hadley Wickham:

Split the data into groups (e.g., by kingdom)
Apply a function to each group (e.g., compute the mean)
Combine the results into a new table

In Pandas this is done with groupby().

Aggregation

df.groupby('kingdom')['genome_size_mb'].mean()

kingdom
Archaea       2.150
Bacteria      4.100
Eukarya     950.475
Name: genome_size_mb, dtype: float64

df.groupby('kingdom')['genome_size_mb'].describe()

	count	mean	std	min	25%	50%	75%	max
kingdom
Archaea	4.0	2.150	0.369685	1.7	2.000	2.15	2.30	2.6
Bacteria	8.0	4.100	1.227076	2.2	3.700	4.25	4.45	6.3
Eukarya	8.0	950.475	1291.848037	12.1	78.325	139.50	1725.00	3100.0

# Group by two columns
df.groupby(['kingdom', 'habitat'])['num_genes'].mean()

kingdom   habitat        
Archaea   aquatic             2200.0
          terrestrial         2200.0
Bacteria  aquatic             3800.0
          gut                 4050.0
          host-associated     2900.0
          soil                4900.0
Eukarya   aquatic            25000.0
          host-associated     5400.0
          soil               13000.0
          terrestrial        20750.0
Name: num_genes, dtype: float64

Exercise

Use groupby to find the maximum gc_percent for each kingdom. Then compute the mean num_genes grouped by habitat.

# Your code here

Working with multiple tables

Real analyses often require combining information from different sources. pd.merge() joins two DataFrames on a shared key—similar to SQL joins.

Let’s create two small DataFrames to demonstrate different join types.

taxonomy = pd.DataFrame({
    'organism': ['Escherichia coli', 'Saccharomyces cerevisiae',
                 'Homo sapiens', 'Halobacterium salinarum'],
    'phylum': ['Pseudomonadota', 'Ascomycota',
               'Chordata', 'Euryarchaeota']
})
taxonomy

	organism	phylum
0	Escherichia coli	Pseudomonadota
1	Saccharomyces cerevisiae	Ascomycota
2	Homo sapiens	Chordata
3	Halobacterium salinarum	Euryarchaeota

isolation = pd.DataFrame({
    'organism': ['Escherichia coli', 'Bacillus subtilis',
                 'Homo sapiens', 'Drosophila melanogaster'],
    'first_sequenced': [1997, 1997, 2001, 2000]
})
isolation

	organism	first_sequenced
0	Escherichia coli	1997
1	Bacillus subtilis	1997
2	Homo sapiens	2001
3	Drosophila melanogaster	2000

Inner join

Keeps only rows whose key appears in both tables.

pd.merge(taxonomy, isolation, on='organism')

	organism	phylum	first_sequenced
0	Escherichia coli	Pseudomonadota	1997
1	Homo sapiens	Chordata	2001

Left join

Keeps all rows from the left table; fills missing matches with NaN.

pd.merge(taxonomy, isolation, on='organism', how='left')

	organism	phylum	first_sequenced
0	Escherichia coli	Pseudomonadota	1997.0
1	Saccharomyces cerevisiae	Ascomycota	NaN
2	Homo sapiens	Chordata	2001.0
3	Halobacterium salinarum	Euryarchaeota	NaN

What is NaN? NaN stands for “Not a Number” and represents missing data. When a left join finds no matching row in the right table, Pandas fills those cells with NaN.

Right join

Keeps all rows from the right table.

pd.merge(taxonomy, isolation, on='organism', how='right')

	organism	phylum	first_sequenced
0	Escherichia coli	Pseudomonadota	1997
1	Bacillus subtilis	NaN	1997
2	Homo sapiens	Chordata	2001
3	Drosophila melanogaster	NaN	2000

Outer join

Keeps all rows from both tables.

pd.merge(taxonomy, isolation, on='organism', how='outer')

	organism	phylum	first_sequenced
0	Bacillus subtilis	NaN	1997.0
1	Drosophila melanogaster	NaN	2000.0
2	Escherichia coli	Pseudomonadota	1997.0
3	Halobacterium salinarum	Euryarchaeota	NaN
4	Homo sapiens	Chordata	2001.0
5	Saccharomyces cerevisiae	Ascomycota	NaN

Exercise

Create two DataFrames and merge them:

habitats = pd.DataFrame({
    'habitat': ['gut', 'soil', 'aquatic'],
    'temperature_c': [37, 20, 15]
})

Merge habitats with df using a left join on 'habitat'. How many rows have NaN for temperature_c? Why?

# Your code here

Visualization with Altair

Let’s make a quick bar chart of mean genome size by kingdom using groupby + Altair.

Altair uses single-letter encoding type suffixes: - :N — nominal (categorical, unordered) data like kingdom names - :Q — quantitative (numeric, continuous) data like genome size - :O — ordinal (ordered categories) and :T — temporal (dates/times)

import altair as alt

kingdom_means = df.groupby('kingdom')['genome_size_mb'].mean().reset_index()

alt.Chart(kingdom_means).mark_bar().encode(
    x=alt.X('kingdom:N', title='Kingdom'),
    y=alt.Y('genome_size_mb:Q', title='Mean genome size (Mb)'),
    color='kingdom:N'
).properties(
    width=300,
    title='Mean Genome Size by Kingdom'
)

Eukarya genomes are orders of magnitude larger than Bacteria and Archaea, making the smaller kingdoms nearly invisible on a linear scale. A logarithmic y-axis lets us compare all three kingdoms meaningfully.

alt.Chart(kingdom_means).mark_bar().encode(
    x=alt.X('kingdom:N', title='Kingdom'),
    y=alt.Y('genome_size_mb:Q', title='Mean genome size (Mb)', scale=alt.Scale(type='sqrt', domainMin=1)),
    color='kingdom:N'
).properties(
    width=300,
    title='Mean Genome Size by Kingdom (log scale)'
)

Exercise

Create a scatter plot of genome_size_mb (x) vs num_genes (y), colored by kingdom. Use mark_point() instead of mark_bar().

# Your code here

Summary

DataFrame basics: create from dict, read CSV, inspect with .head(), .shape, .dtypes
Indexing: [], .loc (label), .iloc (position)
Filtering: boolean indexing with & for compound conditions
New columns: vectorized arithmetic—no loops needed
Tidy data: each variable a column, each observation a row
Reshaping: melt() (wide → long), pivot() (long → wide)
Sorting: sort_values() with multi-column and mixed ascending/descending
Split-apply-combine: groupby() → .mean(), .describe(), .agg()
Joins: pd.merge() with inner, left, right, outer
Visualization: Altair integrates directly with DataFrames

A more realistic example

The Sequence Read Archive (SRA) is the largest public repository of sequencing data, mirrored by the European Nucleotide Archive (ENA). Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.

Setup

import pandas as pd

We use a pre-compiled ENA metadata snapshot hosted on Zenodo. The file contains ~800k records; we load the first 100k for speed.

sra = pd.read_csv(
    "https://zenodo.org/records/10680776/files/ena.tsv.gz",
    compression='gzip',
    sep="\t",
    low_memory=False,
    nrows=100000
)

Explore the data

len(sra)

sra.sample(5)

	study_accession	base_count	accession	collection_date	country	culture_collection	description	sample_collection	sample_title	sequencing_method	...	library_name	library_construction_protocol	library_layout	instrument_model	instrument_platform	isolation_source	isolate	investigation_type	collection_date_submitted	center_name
98517	PRJEB37886	1.021151e+09	SAMEA10938693	2021-10-29	United Kingdom	NaN	Illumina NovaSeq 6000 sequencing; Illumina Nov...	NaN	COG-UK/NEWC-28D0557	NaN	...	NT1705557N / HT-122795:H9	NaN	PAIRED	Illumina NovaSeq 6000	ILLUMINA	NaN	NaN	NaN	2021-10-29	SC
30031	PRJEB43060	2.903531e+08	SAMEA12516416	2021-08-04	Norway	NaN	Illumina NovaSeq 6000 sequencing; Raw reads: S...	NaN	SARS-CoV-2/human/Norway/16523/2021/1	NaN	...	NaN	NaN	PAIRED	Illumina NovaSeq 6000	ILLUMINA	upper respiratory secretion	SARS-CoV-2/human/Norway/16523/2021	NaN	2021-08-04	Norwegian Institute of Public Health (NIPH)
60679	PRJEB37886	8.712552e+08	SAMEA8237191	2021-02-24	United Kingdom	NaN	Illumina NovaSeq 6000 sequencing; Illumina Nov...	NaN	COG-UK/CAMC-1359F50	NaN	...	NT1665242V / DN785905B:A10	NaN	PAIRED	Illumina NovaSeq 6000	ILLUMINA	NaN	NaN	NaN	2021-02-24	SC
8597	PRJNA736036	7.197057e+07	SAMN27058926	2020-11-13	USA: Iowa	NaN	Illumina MiSeq sequencing; Amplicon sequencing...	NaN	This sample has been submitted by pda\|wesley-h...	NaN	...	1425959-IA-M05216	NaN	PAIRED	Illumina MiSeq	ILLUMINA	clinical	IA-SHL-1425959	NaN	2020-11-13	SUB11247375
52307	PRJNA716984	2.460988e+06	SAMN23448018	2021-11-11	USA: New Jersey	NaN	Sequel II sequencing	NaN	CDC Sars CoV2 Sequencing Baseline Constellation	NaN	...	Unknown	Freed primers	PAIRED	Sequel II	PACBIO_SMRT	Nasal Swabs	SARS-CoV-2/Human/USA/NJ-CDC-LC0374318/2021	NaN	2021-11-11	NaN

5 rows × 32 columns

sra.columns.tolist()

['study_accession',
 'base_count',
 'accession',
 'collection_date',
 'country',
 'culture_collection',
 'description',
 'sample_collection',
 'sample_title',
 'sequencing_method',
 'sample_material',
 'sample_description',
 'sample_accession',
 'sample_capture_status',
 'sample_alias',
 'library_selection',
 'location',
 'run_accession',
 'read_count',
 'project_name',
 'library_source',
 'library_strategy',
 'library_name',
 'library_construction_protocol',
 'library_layout',
 'instrument_model',
 'instrument_platform',
 'isolation_source',
 'isolate',
 'investigation_type',
 'collection_date_submitted',
 'center_name']

sra['instrument_platform'].value_counts()

instrument_platform
ILLUMINA           81692
PACBIO_SMRT         8431
OXFORD_NANOPORE     8066
ION_TORRENT         1790
BGISEQ                18
DNBSEQ                 3
Name: count, dtype: int64

Clean dates

# Convert collection_date to datetime
# errors='coerce' turns unparseable dates into NaT (Not a Time)
sra = sra.assign(collection_date=pd.to_datetime(sra['collection_date'], errors='coerce'))

print('Earliest entry:', sra['collection_date'].min())
print('Latest entry:', sra['collection_date'].max())

Earliest entry: 2019-12-30 00:00:00
Latest entry: 2023-01-20 00:00:00

⚠️ Data Quality: Don’t get surprised here — the metadata is only as good as the person who entered it. When you enter metadata for your sequencing data, pay attention!

# Filter to valid date range
sra = sra[
    (sra['collection_date'] >= pd.Timestamp('2020-01-01'))
    &
    (sra['collection_date'] <= pd.Timestamp('2023-12-31'))
]

Aggregate for visualization

heatmap_2d = sra.groupby(
    ['instrument_platform', 'library_strategy']
).agg(
    {'run_accession': 'nunique'}
).reset_index()

heatmap_2d

	instrument_platform	library_strategy	run_accession
0	BGISEQ	AMPLICON	1
1	BGISEQ	OTHER	13
2	BGISEQ	RNA-Seq	2
3	BGISEQ	Targeted-Capture	2
4	DNBSEQ	AMPLICON	3
5	ILLUMINA	AMPLICON	78262
6	ILLUMINA	OTHER	2
7	ILLUMINA	RNA-Seq	524
8	ILLUMINA	Targeted-Capture	272
9	ILLUMINA	WCS	2
10	ILLUMINA	WGA	1389
11	ILLUMINA	WGS	977
12	ILLUMINA	miRNA-Seq	3
13	ION_TORRENT	AMPLICON	1752
14	ION_TORRENT	RNA-Seq	3
15	ION_TORRENT	WGA	11
16	ION_TORRENT	WGS	24
17	OXFORD_NANOPORE	AMPLICON	7305
18	OXFORD_NANOPORE	RNA-Seq	292
19	OXFORD_NANOPORE	WGA	134
20	OXFORD_NANOPORE	WGS	227
21	PACBIO_SMRT	AMPLICON	8379
22	PACBIO_SMRT	RNA-Seq	13
23	PACBIO_SMRT	Targeted-Capture	6
24	PACBIO_SMRT	WGS	1

Visualize with Altair

import altair as alt

back = alt.Chart(heatmap_2d).mark_rect(opacity=1).encode(
    x=alt.X(
        "instrument_platform:N",
        title="Instrument"
    ),
    y=alt.Y(
        "library_strategy:N",
        title="Strategy",
        axis=alt.Axis(orient='right')
    ),
    color=alt.Color(
        "run_accession:Q",
        title="# Samples",
        scale=alt.Scale(
            scheme="goldred",
            type="log"
        ),
    ),
    tooltip=[
        alt.Tooltip("instrument_platform:N", title="Machine"),
        alt.Tooltip("run_accession:Q", title="Number of runs"),
        alt.Tooltip("library_strategy:N", title="Protocol")
    ]
).properties(
    width=500,
    height=150,
    title={
        "text": ["Breakdown of datasets from ENA",
                 "by Platform and Library Strategy"],
        "subtitle": "(Sample of 100k records)"
    }
)

back

# Add text labels
front = back.mark_text(
    align="center",
    baseline="middle",
    fontSize=12,
    fontWeight="bold",
).encode(
    text=alt.Text("run_accession:Q", format=",.0f"),
    color=alt.condition(
        alt.datum.run_accession > 200,
        alt.value("white"),
        alt.value("black")
    )
)

# Combine layers
back + front