Open In Colab

Lecture 8: Introduction to Altair

This notebook introduces Altair, a Python library for creating statistical visualizations. We start with the basics and progressively build toward analyzing real-world genomic metadata.

By the end, you will be able to: - Create basic charts (scatter, bar, line) - Encode data fields to visual properties - Aggregate and transform data within charts - Customize colors, scales, and labels - Layer multiple chart elements - Build publication-quality heatmaps

Part 1: What is Declarative Visualization?

Think of ordering food at a restaurant. You don’t walk into the kitchen and say “heat the pan to 375°F, dice the onions, sauté for 3 minutes…” — you just say “I’d like the pasta.” That’s the difference between imperative and declarative.

Imperative (matplotlib) — you specify how to draw, step by step. You manage coordinates, colors, labels, legends, and layout yourself:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 3))
cities = ['Seattle', 'New York', 'Chicago']
temps = [53.7, 52.7, 48.7]
colors = ['#4c78a8', '#f58518', '#e45756']
bars = ax.barh(cities, temps, color=colors)
ax.set_xlabel('Average Temperature (°F)')
ax.set_title('Average Temperature by City')
ax.bar_label(bars, fmt='%.1f')
ax.set_xlim(0, 65)
plt.tight_layout()
plt.show()

Declarative (Altair) — you describe what you want to see. You state the relationships between your data and visual properties. Altair handles scales, axes, labels, and layout automatically:

alt.Chart(weather).mark_bar().encode(
    x='average(temp):Q',
    y='city:N',
    color='city:N'
)

The key difference: with matplotlib you compute the averages yourself, position each bar, pick colors, format labels, and manage layout. With Altair you declare “show average temperature by city, color by city” and the library does the rest — including aggregation, axis scaling, and a legend.

Part 2: Setup

import pandas as pd
import altair as alt

Let’s create a simple dataset to work with—monthly precipitation for three cities:

weather = pd.DataFrame({
    'city': ['Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle',
             'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
             'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
    'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
              'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
    'precip': [5.2, 3.9, 4.1, 2.8, 2.1, 1.6,
               3.6, 3.1, 4.2, 4.0, 4.5, 4.2,
               2.0, 1.9, 2.6, 3.7, 4.1, 4.0],
    'temp': [42, 45, 50, 55, 62, 68,
             35, 38, 48, 58, 68, 77,
             28, 32, 42, 52, 64, 74]
})

weather
city month precip temp
0 Seattle Jan 5.2 42
1 Seattle Feb 3.9 45
2 Seattle Mar 4.1 50
3 Seattle Apr 2.8 55
4 Seattle May 2.1 62
5 Seattle Jun 1.6 68
6 New York Jan 3.6 35
7 New York Feb 3.1 38
8 New York Mar 4.2 48
9 New York Apr 4.0 58
10 New York May 4.5 68
11 New York Jun 4.2 77
12 Chicago Jan 2.0 28
13 Chicago Feb 1.9 32
14 Chicago Mar 2.6 42
15 Chicago Apr 3.7 52
16 Chicago May 4.1 64
17 Chicago Jun 4.0 74

This is tidy data: each row is one observation, each column is one variable. Altair works best with tidy data.

Part 3: Your First Chart

The Three Building Blocks

Every Altair chart has three components:

  1. Data — a pandas DataFrame
  2. Mark — the visual shape (point, bar, line, etc.)
  3. Encoding — which data fields map to which visual properties

Creating a Chart Object

Start by wrapping your DataFrame in alt.Chart():

# This creates a chart object - it stores data but can't render without a mark
chart = alt.Chart(weather)
print(type(chart))  # It's an Altair Chart object
<class 'altair.vegalite.v6.api.Chart'>

The chart object exists but can’t display—Altair requires a mark to render. Let’s add one.

Adding a Mark

# mark_point() draws circles
alt.Chart(weather).mark_point()

We see points, but they’re all stacked on top of each other. We need encodings to spread them out.

Adding Encodings

Encodings map data columns to visual channels like position (x, y), color, size, etc.

alt.Chart(weather).mark_point().encode(
    x='precip',
    y='city'
)

Now each point is positioned: - Horizontally by precipitation value - Vertically by city name

Notice how Altair automatically: - Created axis labels from column names - Scaled the x-axis to fit the data - Separated cities on the y-axis

Different Mark Types

Altair provides many mark types. Here are the most common:

# Bar chart
alt.Chart(weather).mark_bar().encode(
    x='precip',
    y='city'
)
# Line chart
alt.Chart(weather).mark_line().encode(
    x='month',
    y='precip'
)

The line chart connects all points. We’ll learn how to separate by city later using color encoding.

📝 Exercise 1: Create a scatter plot with temp on x-axis and precip on y-axis.

Part 4: Data Types

Altair needs to know the type of each data field to choose appropriate scales and displays:

Type Code Description Example
Quantitative :Q Numerical values Temperature, price
Nominal :N Categories (no order) City names, colors
Ordinal :O Ordered categories Small/Medium/Large
Temporal :T Date/time 2024-01-15

You specify types by adding them after the field name with a colon:

# Explicit type annotations
alt.Chart(weather).mark_bar().encode(
    x='precip:Q',  # Quantitative
    y='city:N'     # Nominal
)

Altair usually guesses correctly, but explicit types prevent surprises.

Controlling Sort Order

By default, Altair sorts axis values alphabetically. To get chronological order, you must explicitly specify the sort order:

# Default: sorted alphabetically (Apr, Feb, Jan, Jun, Mar, May)
alt.Chart(weather).mark_bar().encode(
    x='month:O',
    y='average(precip):Q'
)
# Explicit sort: chronological order
alt.Chart(weather).mark_bar().encode(
    x=alt.X('month:O', sort=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']),
    y='average(precip):Q'
)

📝 Exercise 2: Create a bar chart showing average temperature per month with proper chronological order.

In Altair, you can aggregate directly in the encoding string:

alt.Chart(weather).mark_bar().encode(
    x='average(precip):Q',  # Average of precip column
    y='city:N'
)

Altair automatically grouped by city and calculated the average for each.

Available Aggregation Functions

  • count() — number of rows
  • sum(field) — total
  • average(field) or mean(field) — average
  • median(field) — median
  • min(field) / max(field) — extremes
  • stdev(field) — standard deviation
# Count observations per city
alt.Chart(weather).mark_bar().encode(
    x='count():Q',
    y='city:N'
)
# Max temperature per city
alt.Chart(weather).mark_bar().encode(
    x='max(temp):Q',
    y='city:N'
)

📝 Exercise 3: Create a bar chart showing total precipitation per month (across all cities).

Part 6: Color Encoding

alt.Chart(weather).mark_line().encode(
    x='month:O',
    y='precip:Q',
    color='city:N'  # Different color for each city
)

Each city now has its own line with a distinct color. Altair added a legend automatically.

Color for Quantitative Data

You can also map numeric values to color intensity. See Vega Color Schemes for all available palettes.

alt.Chart(weather).mark_circle(size=100).encode(
    x='month:O',
    y='city:N',
    color='precip:Q'  # Color intensity shows precipitation
)
# Try different color schemes from Vega
alt.Chart(weather).mark_circle(size=100).encode(
    x='month:O',
    y='city:N',
    color=alt.Color('precip:Q', scale=alt.Scale(scheme='viridis'))  # Try: 'plasma', 'inferno', 'magma', 'turbo', 'blues', 'greens', 'oranges', 'reds', 'purples', 'goldred', 'redyellowblue'
)

Darker colors indicate higher precipitation. This is the foundation of a heatmap!

📝 Exercise 4: Create a scatter plot of temp vs precip with color encoding for city.

alt.Chart(weather).mark_point(
    color='firebrick',  # Fixed color for all points
    size=100,           # Fixed size
    opacity=0.7         # Transparency
).encode(
    x='temp:Q',
    y='precip:Q'
)
alt.Chart(weather).mark_bar(color='steelblue').encode(
    x='average(precip):Q',
    y='city:N'
).properties(
    width=400,
    height=150,
    title='Average Precipitation by City'
)

Axis and Scale Customization

For more control, use alt.X() and alt.Y() objects instead of strings:

alt.Chart(weather).mark_bar(color='teal').encode(
    x=alt.X(
        'average(precip):Q',
        title='Average Precipitation (inches)',  # Custom axis title
        scale=alt.Scale(domain=[0, 6])           # Fixed axis range
    ),
    y=alt.Y(
        'city:N',
        title='City',
        axis=alt.Axis(labelFontSize=12)          # Larger labels
    )
).properties(
    width=400,
    height=150
)

Color Schemes

Altair includes many built-in color schemes:

alt.Chart(weather).mark_circle(size=200).encode(
    x='month:O',
    y='city:N',
    color=alt.Color(
        'precip:Q',
        scale=alt.Scale(scheme='blues')  # Blue color gradient
    )
).properties(width=300, height=150)

Popular schemes: 'blues', 'greens', 'oranges', 'viridis', 'goldred', 'redyellowblue'

📝 Exercise 5: Create a bar chart of average temperature per city with orange bars and a title.

# Heatmap base
heatmap = alt.Chart(weather).mark_rect().encode(
    x='month:O',
    y='city:N',
    color=alt.Color('precip:Q', scale=alt.Scale(scheme='goldred'))
)

# Text with conditional color
text = alt.Chart(weather).mark_text(
    fontSize=12,
    fontWeight='bold'
).encode(
    x='month:O',
    y='city:N',
    text=alt.Text('precip:Q', format='.1f'),
    color=alt.condition(
        alt.datum.precip > 3.5,    # If precip > 3.5
        alt.value('white'),         # Use white text
        alt.value('black')          # Otherwise black
    )
)

(heatmap + text).properties(width=300, height=150)

Now high values have white text (readable on dark red) and low values have black text.

# Data with huge range
wide_range = pd.DataFrame({
    'category': ['A', 'B', 'C', 'D'],
    'value': [10, 100, 1000, 50000]
})

# Linear scale - small values barely visible
alt.Chart(wide_range).mark_bar().encode(
    x='category:N',
    y='value:Q'
).properties(title='Linear Scale')

📝 Exercise 6: Create a bar chart of average temperature per city with text labels on the bars.

# Log color scale - shows variation across all magnitudes
alt.Chart(wide_range).mark_rect().encode(
    x='category:N',
    y=alt.value(1),  # Single row
    color=alt.Color(
        'value:Q',
        scale=alt.Scale(scheme='goldred', type='log')
    )
).properties(width=300, height=50, title='Log Color Scale')

Part 8: Real-World Example — SRA Metadata

The Sequence Read Archive (SRA) is the largest public repository of sequencing data. Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.

# Load SRA metadata snapshot from Zenodo (first 100k records for speed)
sra = pd.read_csv(
    "https://zenodo.org/records/10680776/files/ena.tsv.gz",
    compression='gzip',
    sep="\t",
    low_memory=False,
    nrows=100000
)

sra.sample(3)
study_accession base_count accession collection_date country culture_collection description sample_collection sample_title sequencing_method ... library_name library_construction_protocol library_layout instrument_model instrument_platform isolation_source isolate investigation_type collection_date_submitted center_name
35094 PRJEB43060 242603897.0 SAMEA13362767 2020-10-03 Norway NaN Illumina MiSeq sequencing; Raw reads: SARS-CoV... NaN SARS-CoV-2/human/Norway/4099/2020/1 NaN ... NaN NaN PAIRED Illumina MiSeq ILLUMINA not provided SARS-CoV-2/human/Norway/4099/2020 NaN 2020-10-03 Norwegian Institute of Public Health (NIPH)
27221 PRJEB52934 82336750.0 SAMEA14459592 2022-04-16 Estonia NaN Illumina NovaSeq 6000 sequencing; Illumina Nov... NaN PCR tiled amplicon WGS of SARS-Cov-2, pre-sele... NaN ... EstECDC0112 Produced for Workflow v.1.8.9 (Eurofins Genom... PAIRED Illumina NovaSeq 6000 ILLUMINA NaN NaN NaN 2022-04-16 Health Board of Estonia
82022 PRJEB37886 855160155.0 SAMEA10170806 2021-09-13 United Kingdom NaN Illumina NovaSeq 6000 sequencing; Illumina Nov... NaN COG-UK/ALDP-1E70310 NaN ... NT1696099F / HT-119742:F1 NaN PAIRED Illumina NovaSeq 6000 ILLUMINA NaN NaN NaN 2021-09-13 SC

3 rows × 32 columns

⚠️ Data Quality: The metadata is only as good as who entered it. Always validate date ranges!

Aggregate for Visualization

# Group by platform and library strategy, count unique runs
heatmap_data = sra.groupby(
    ['instrument_platform', 'library_strategy']
).agg(
    {'run_accession': 'nunique'}
).reset_index()

heatmap_data
instrument_platform library_strategy run_accession
0 BGISEQ AMPLICON 1
1 BGISEQ OTHER 13
2 BGISEQ RNA-Seq 2
3 BGISEQ Targeted-Capture 2
4 DNBSEQ AMPLICON 3
5 ILLUMINA AMPLICON 78448
6 ILLUMINA OTHER 3
7 ILLUMINA RNA-Seq 554
8 ILLUMINA Targeted-Capture 273
9 ILLUMINA WCS 2
10 ILLUMINA WGA 1389
11 ILLUMINA WGS 1020
12 ILLUMINA miRNA-Seq 3
13 ION_TORRENT AMPLICON 1752
14 ION_TORRENT RNA-Seq 3
15 ION_TORRENT WGA 11
16 ION_TORRENT WGS 24
17 OXFORD_NANOPORE AMPLICON 7407
18 OXFORD_NANOPORE OTHER 2
19 OXFORD_NANOPORE RNA-Seq 296
20 OXFORD_NANOPORE WGA 134
21 OXFORD_NANOPORE WGS 227
22 PACBIO_SMRT AMPLICON 8411
23 PACBIO_SMRT RNA-Seq 13
24 PACBIO_SMRT Targeted-Capture 6
25 PACBIO_SMRT WGS 1

Create the Heatmap

# Basic heatmap
alt.Chart(heatmap_data).mark_rect().encode(
    x='instrument_platform:N',
    y='library_strategy:N',
    color='run_accession:Q'
)

Final Polished Heatmap

# Background: colored rectangles
background = alt.Chart(heatmap_data).mark_rect(opacity=1).encode(
    x=alt.X(
        'instrument_platform:N',
        title='Sequencing Platform'
    ),
    y=alt.Y(
        'library_strategy:N',
        title='Library Strategy',
        axis=alt.Axis(orient='right')
    ),
    color=alt.Color(
        'run_accession:Q',
        title='# Samples',
        scale=alt.Scale(
            scheme='goldred',
            type='log'  # Log scale for color!
        )
    ),
    tooltip=[
        alt.Tooltip('instrument_platform:N', title='Platform'),
        alt.Tooltip('library_strategy:N', title='Strategy'),
        alt.Tooltip('run_accession:Q', title='Number of runs', format=',')
    ]
).properties(
    width=500,
    height=200,
    title={
        'text': 'SARS-CoV-2 Sequencing in ENA',
        'subtitle': 'By Platform and Library Strategy (100k sample)'
    }
)

background

Add Text Labels

# Text layer with conditional coloring
text_labels = background.mark_text(
    align='center',
    baseline='middle',
    fontSize=11,
    fontWeight='bold'
).encode(
    text=alt.Text('run_accession:Q', format=','),  # Comma-formatted numbers
    color=alt.condition(
        alt.datum.run_accession > 200,  # If value > 200
        alt.value('white'),              # White text (on dark background)
        alt.value('black')               # Black text (on light background)
    )
)

# Combine layers
background + text_labels

This visualization reveals: - ILLUMINA + AMPLICON dominates (78k+ samples) — Illumina short-reads with PCR amplification - PACBIO_SMRT also heavily uses AMPLICON protocol - RNA-Seq is relatively rare compared to AMPLICON - Some platform/strategy combinations have very few samples

📝 Exercise 7: Create a bar chart showing the top 5 countries by number of SRA submissions.

Summary

Concept Syntax
Create chart alt.Chart(df)
Add marks .mark_point(), .mark_bar(), .mark_line(), .mark_rect()
Encode data .encode(x='col', y='col')
Data types :Q (quantitative), :N (nominal), :O (ordinal), :T (temporal)
Aggregation 'average(col):Q', 'sum(col):Q', 'count():Q'
Color encoding color='col:N' or color=alt.Color('col:Q', scale=alt.Scale(scheme='blues'))
Customization alt.X('col', title='Label', scale=alt.Scale(...))
Properties .properties(width=400, height=200, title='Title')
Layer charts chart1 + chart2
Conditional alt.condition(predicate, if_true, if_false)
Log scale scale=alt.Scale(type='log')

Further Resources

Take-home project 1Now apply what you’ve learned: Take-home project 1