Open In Colab

Lecture 8: Introduction to Altair

This notebook introduces Altair, a Python library for creating statistical visualizations. We start with the basics and progressively build toward analyzing real-world genomic metadata.

By the end, you will be able to: - Create basic charts (scatter, bar, line) - Encode data fields to visual properties - Aggregate and transform data within charts - Customize colors, scales, and labels - Layer multiple chart elements - Build publication-quality heatmaps

Part 1: What is Declarative Visualization?

Think of ordering food at a restaurant. You don’t walk into the kitchen and say “heat the pan to 375°F, dice the onions, sauté for 3 minutes…” — you just say “I’d like the pasta.” That’s the difference between imperative and declarative.

Imperative (matplotlib) — you specify how to draw, step by step. You manage coordinates, colors, labels, legends, and layout yourself:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3))
cities = ['Seattle', 'New York', 'Chicago']
temps = [53.7, 52.7, 48.7]
colors = ['#4c78a8', '#f58518', '#e45756']
bars = ax.barh(cities, temps, color=colors)
ax.set_xlabel('Average Temperature (°F)')
ax.set_title('Average Temperature by City')
ax.bar_label(bars, fmt='%.1f')
ax.set_xlim(0, 65)
plt.tight_layout()
plt.show()

Declarative (Altair) — you describe what you want to see. You state the relationships between your data and visual properties. Altair handles scales, axes, labels, and layout automatically:

alt.Chart(weather).mark_bar().encode(
    x='average(temp):Q',
    y='city:N',
    color='city:N'
)

The key difference: with matplotlib you compute the averages yourself, position each bar, pick colors, format labels, and manage layout. With Altair you declare “show average temperature by city, color by city” and the library does the rest — including aggregation, axis scaling, and a legend.

Part 2: Setup

import pandas as pd
import altair as alt

Let’s create a simple dataset to work with—monthly precipitation for three cities:

weather = pd.DataFrame({
    'city': ['Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle',
             'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
             'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'precip': [5.2, 3.9, 4.1, 2.8, 2.1, 1.6,
               3.6, 3.1, 4.2, 4.0, 4.5, 4.2,
               2.0, 1.9, 2.6, 3.7, 4.1, 4.0],
    'temp': [42, 45, 50, 55, 62, 68,
             35, 38, 48, 58, 68, 77,
             28, 32, 42, 52, 64, 74]
})

weather

	city	month	precip	temp
0	Seattle	Jan	5.2	42
1	Seattle	Feb	3.9	45
2	Seattle	Mar	4.1	50
3	Seattle	Apr	2.8	55
4	Seattle	May	2.1	62
5	Seattle	Jun	1.6	68
6	New York	Jan	3.6	35
7	New York	Feb	3.1	38
8	New York	Mar	4.2	48
9	New York	Apr	4.0	58
10	New York	May	4.5	68
11	New York	Jun	4.2	77
12	Chicago	Jan	2.0	28
13	Chicago	Feb	1.9	32
14	Chicago	Mar	2.6	42
15	Chicago	Apr	3.7	52
16	Chicago	May	4.1	64
17	Chicago	Jun	4.0	74

This is tidy data: each row is one observation, each column is one variable. Altair works best with tidy data.

Part 3: Your First Chart

The Three Building Blocks

Every Altair chart has three components:

Data — a pandas DataFrame
Mark — the visual shape (point, bar, line, etc.)
Encoding — which data fields map to which visual properties

Creating a Chart Object

Start by wrapping your DataFrame in alt.Chart():

# This creates a chart object - it stores data but can't render without a mark
chart = alt.Chart(weather)
print(type(chart))  # It's an Altair Chart object

<class 'altair.vegalite.v6.api.Chart'>

The chart object exists but can’t display—Altair requires a mark to render. Let’s add one.

Adding a Mark

# mark_point() draws circles
alt.Chart(weather).mark_point()

We see points, but they’re all stacked on top of each other. We need encodings to spread them out.

Adding Encodings

Encodings map data columns to visual channels like position (x, y), color, size, etc.

alt.Chart(weather).mark_point().encode(
    x='precip',
    y='city'
)

Now each point is positioned: - Horizontally by precipitation value - Vertically by city name

Notice how Altair automatically: - Created axis labels from column names - Scaled the x-axis to fit the data - Separated cities on the y-axis

Different Mark Types

Altair provides many mark types. Here are the most common:

# Bar chart
alt.Chart(weather).mark_bar().encode(
    x='precip',
    y='city'
)

# Line chart
alt.Chart(weather).mark_line().encode(
    x='month',
    y='precip'
)

The line chart connects all points. We’ll learn how to separate by city later using color encoding.

📝 Exercise 1: Create a scatter plot with temp on x-axis and precip on y-axis.

Part 4: Data Types

Altair needs to know the type of each data field to choose appropriate scales and displays:

Type	Code	Description	Example
Quantitative	`:Q`	Numerical values	Temperature, price
Nominal	`:N`	Categories (no order)	City names, colors
Ordinal	`:O`	Ordered categories	Small/Medium/Large
Temporal	`:T`	Date/time	2024-01-15

You specify types by adding them after the field name with a colon:

# Explicit type annotations
alt.Chart(weather).mark_bar().encode(
    x='precip:Q',  # Quantitative
    y='city:N'     # Nominal
)

Altair usually guesses correctly, but explicit types prevent surprises.

Controlling Sort Order

By default, Altair sorts axis values alphabetically. To get chronological order, you must explicitly specify the sort order:

# Default: sorted alphabetically (Apr, Feb, Jan, Jun, Mar, May)
alt.Chart(weather).mark_bar().encode(
    x='month:O',
    y='average(precip):Q'
)

# Explicit sort: chronological order
alt.Chart(weather).mark_bar().encode(
    x=alt.X('month:O', sort=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']),
    y='average(precip):Q'
)

📝 Exercise 2: Create a bar chart showing average temperature per month with proper chronological order.

In Altair, you can aggregate directly in the encoding string:

alt.Chart(weather).mark_bar().encode(
    x='average(precip):Q',  # Average of precip column
    y='city:N'
)

Altair automatically grouped by city and calculated the average for each.

Available Aggregation Functions

count() — number of rows
sum(field) — total
average(field) or mean(field) — average
median(field) — median
min(field) / max(field) — extremes
stdev(field) — standard deviation

# Count observations per city
alt.Chart(weather).mark_bar().encode(
    x='count():Q',
    y='city:N'
)

# Max temperature per city
alt.Chart(weather).mark_bar().encode(
    x='max(temp):Q',
    y='city:N'
)

📝 Exercise 3: Create a bar chart showing total precipitation per month (across all cities).

Part 6: Color Encoding

alt.Chart(weather).mark_line().encode(
    x='month:O',
    y='precip:Q',
    color='city:N'  # Different color for each city
)

Each city now has its own line with a distinct color. Altair added a legend automatically.

Color for Quantitative Data

You can also map numeric values to color intensity. See Vega Color Schemes for all available palettes.

alt.Chart(weather).mark_circle(size=100).encode(
    x='month:O',
    y='city:N',
    color='precip:Q'  # Color intensity shows precipitation
)

# Try different color schemes from Vega
alt.Chart(weather).mark_circle(size=100).encode(
    x='month:O',
    y='city:N',
    color=alt.Color('precip:Q', scale=alt.Scale(scheme='viridis'))  # Try: 'plasma', 'inferno', 'magma', 'turbo', 'blues', 'greens', 'oranges', 'reds', 'purples', 'goldred', 'redyellowblue'
)

Darker colors indicate higher precipitation. This is the foundation of a heatmap!

📝 Exercise 4: Create a scatter plot of temp vs precip with color encoding for city.

alt.Chart(weather).mark_point(
    color='firebrick',  # Fixed color for all points
    size=100,           # Fixed size
    opacity=0.7         # Transparency
).encode(
    x='temp:Q',
    y='precip:Q'
)

alt.Chart(weather).mark_bar(color='steelblue').encode(
    x='average(precip):Q',
    y='city:N'
).properties(
    width=400,
    height=150,
    title='Average Precipitation by City'
)

Axis and Scale Customization

For more control, use alt.X() and alt.Y() objects instead of strings:

alt.Chart(weather).mark_bar(color='teal').encode(
    x=alt.X(
        'average(precip):Q',
        title='Average Precipitation (inches)',  # Custom axis title
        scale=alt.Scale(domain=[0, 6])           # Fixed axis range
    ),
    y=alt.Y(
        'city:N',
        title='City',
        axis=alt.Axis(labelFontSize=12)          # Larger labels
    )
).properties(
    width=400,
    height=150
)

Color Schemes

Altair includes many built-in color schemes:

alt.Chart(weather).mark_circle(size=200).encode(
    x='month:O',
    y='city:N',
    color=alt.Color(
        'precip:Q',
        scale=alt.Scale(scheme='blues')  # Blue color gradient
    )
).properties(width=300, height=150)

Popular schemes: 'blues', 'greens', 'oranges', 'viridis', 'goldred', 'redyellowblue'

📝 Exercise 5: Create a bar chart of average temperature per city with orange bars and a title.

# Heatmap base
heatmap = alt.Chart(weather).mark_rect().encode(
    x='month:O',
    y='city:N',
    color=alt.Color('precip:Q', scale=alt.Scale(scheme='goldred'))
)

# Text with conditional color
text = alt.Chart(weather).mark_text(
    fontSize=12,
    fontWeight='bold'
).encode(
    x='month:O',
    y='city:N',
    text=alt.Text('precip:Q', format='.1f'),
    color=alt.condition(
        alt.datum.precip > 3.5,    # If precip > 3.5
        alt.value('white'),         # Use white text
        alt.value('black')          # Otherwise black
    )
)

(heatmap + text).properties(width=300, height=150)

Now high values have white text (readable on dark red) and low values have black text.

# Data with huge range
wide_range = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D'],
    'value': [10, 100, 1000, 50000]
})

# Linear scale - small values barely visible
alt.Chart(wide_range).mark_bar().encode(
    x='category:N',
    y='value:Q'
).properties(title='Linear Scale')

📝 Exercise 6: Create a bar chart of average temperature per city with text labels on the bars.

# Log color scale - shows variation across all magnitudes
alt.Chart(wide_range).mark_rect().encode(
    x='category:N',
    y=alt.value(1),  # Single row
    color=alt.Color(
        'value:Q',
        scale=alt.Scale(scheme='goldred', type='log')
    )
).properties(width=300, height=50, title='Log Color Scale')

Part 8: Real-World Example — SRA Metadata

The Sequence Read Archive (SRA) is the largest public repository of sequencing data. Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.

# Load SRA metadata snapshot from Zenodo (first 100k records for speed)
sra = pd.read_csv(
    "https://zenodo.org/records/10680776/files/ena.tsv.gz",
    compression='gzip',
    sep="\t",
    low_memory=False,
    nrows=100000
)

sra.sample(3)

	study_accession	base_count	accession	collection_date	country	culture_collection	description	sample_collection	sample_title	sequencing_method	...	library_name	library_construction_protocol	library_layout	instrument_model	instrument_platform	isolation_source	isolate	investigation_type	collection_date_submitted	center_name
35094	PRJEB43060	242603897.0	SAMEA13362767	2020-10-03	Norway	NaN	Illumina MiSeq sequencing; Raw reads: SARS-CoV...	NaN	SARS-CoV-2/human/Norway/4099/2020/1	NaN	...	NaN	NaN	PAIRED	Illumina MiSeq	ILLUMINA	not provided	SARS-CoV-2/human/Norway/4099/2020	NaN	2020-10-03	Norwegian Institute of Public Health (NIPH)
27221	PRJEB52934	82336750.0	SAMEA14459592	2022-04-16	Estonia	NaN	Illumina NovaSeq 6000 sequencing; Illumina Nov...	NaN	PCR tiled amplicon WGS of SARS-Cov-2, pre-sele...	NaN	...	EstECDC0112	Produced for Workflow v.1.8.9 (Eurofins Genom...	PAIRED	Illumina NovaSeq 6000	ILLUMINA	NaN	NaN	NaN	2022-04-16	Health Board of Estonia
82022	PRJEB37886	855160155.0	SAMEA10170806	2021-09-13	United Kingdom	NaN	Illumina NovaSeq 6000 sequencing; Illumina Nov...	NaN	COG-UK/ALDP-1E70310	NaN	...	NT1696099F / HT-119742:F1	NaN	PAIRED	Illumina NovaSeq 6000	ILLUMINA	NaN	NaN	NaN	2021-09-13	SC

3 rows × 32 columns

⚠️ Data Quality: The metadata is only as good as who entered it. Always validate date ranges!

Aggregate for Visualization

# Group by platform and library strategy, count unique runs
heatmap_data = sra.groupby(
    ['instrument_platform', 'library_strategy']
).agg(
    {'run_accession': 'nunique'}
).reset_index()

heatmap_data

	instrument_platform	library_strategy	run_accession
0	BGISEQ	AMPLICON	1
1	BGISEQ	OTHER	13
2	BGISEQ	RNA-Seq	2
3	BGISEQ	Targeted-Capture	2
4	DNBSEQ	AMPLICON	3
5	ILLUMINA	AMPLICON	78448
6	ILLUMINA	OTHER	3
7	ILLUMINA	RNA-Seq	554
8	ILLUMINA	Targeted-Capture	273
9	ILLUMINA	WCS	2
10	ILLUMINA	WGA	1389
11	ILLUMINA	WGS	1020
12	ILLUMINA	miRNA-Seq	3
13	ION_TORRENT	AMPLICON	1752
14	ION_TORRENT	RNA-Seq	3
15	ION_TORRENT	WGA	11
16	ION_TORRENT	WGS	24
17	OXFORD_NANOPORE	AMPLICON	7407
18	OXFORD_NANOPORE	OTHER	2
19	OXFORD_NANOPORE	RNA-Seq	296
20	OXFORD_NANOPORE	WGA	134
21	OXFORD_NANOPORE	WGS	227
22	PACBIO_SMRT	AMPLICON	8411
23	PACBIO_SMRT	RNA-Seq	13
24	PACBIO_SMRT	Targeted-Capture	6
25	PACBIO_SMRT	WGS	1

Create the Heatmap

# Basic heatmap
alt.Chart(heatmap_data).mark_rect().encode(
    x='instrument_platform:N',
    y='library_strategy:N',
    color='run_accession:Q'
)

Final Polished Heatmap

# Background: colored rectangles
background = alt.Chart(heatmap_data).mark_rect(opacity=1).encode(
    x=alt.X(
        'instrument_platform:N',
        title='Sequencing Platform'
    ),
    y=alt.Y(
        'library_strategy:N',
        title='Library Strategy',
        axis=alt.Axis(orient='right')
    ),
    color=alt.Color(
        'run_accession:Q',
        title='# Samples',
        scale=alt.Scale(
            scheme='goldred',
            type='log'  # Log scale for color!
        )
    ),
    tooltip=[
        alt.Tooltip('instrument_platform:N', title='Platform'),
        alt.Tooltip('library_strategy:N', title='Strategy'),
        alt.Tooltip('run_accession:Q', title='Number of runs', format=',')
    ]
).properties(
    width=500,
    height=200,
    title={
        'text': 'SARS-CoV-2 Sequencing in ENA',
        'subtitle': 'By Platform and Library Strategy (100k sample)'
    }
)

background

Add Text Labels

# Text layer with conditional coloring
text_labels = background.mark_text(
    align='center',
    baseline='middle',
    fontSize=11,
    fontWeight='bold'
).encode(
    text=alt.Text('run_accession:Q', format=','),  # Comma-formatted numbers
    color=alt.condition(
        alt.datum.run_accession > 200,  # If value > 200
        alt.value('white'),              # White text (on dark background)
        alt.value('black')               # Black text (on light background)
    )
)

# Combine layers
background + text_labels

This visualization reveals: - ILLUMINA + AMPLICON dominates (78k+ samples) — Illumina short-reads with PCR amplification - PACBIO_SMRT also heavily uses AMPLICON protocol - RNA-Seq is relatively rare compared to AMPLICON - Some platform/strategy combinations have very few samples

📝 Exercise 7: Create a bar chart showing the top 5 countries by number of SRA submissions.

Summary

Concept	Syntax
Create chart	`alt.Chart(df)`
Add marks	`.mark_point()`, `.mark_bar()`, `.mark_line()`, `.mark_rect()`
Encode data	`.encode(x='col', y='col')`
Data types	`:Q` (quantitative), `:N` (nominal), `:O` (ordinal), `:T` (temporal)
Aggregation	`'average(col):Q'`, `'sum(col):Q'`, `'count():Q'`
Color encoding	`color='col:N'` or `color=alt.Color('col:Q', scale=alt.Scale(scheme='blues'))`
Customization	`alt.X('col', title='Label', scale=alt.Scale(...))`
Properties	`.properties(width=400, height=200, title='Title')`
Layer charts	`chart1 + chart2`
Conditional	`alt.condition(predicate, if_true, if_false)`
Log scale	`scale=alt.Scale(type='log')`

Further Resources

Altair Documentation — Official docs with tutorials
Altair Example Gallery — Hundreds of examples to copy
Vega-Lite — The underlying grammar Altair uses
Vega Color Schemes — All available color palettes