import pandas as pd
import altair as altLecture 8: Introduction to Altair
This notebook introduces Altair, a Python library for creating statistical visualizations. We start with the basics and progressively build toward analyzing real-world genomic metadata.
By the end, you will be able to: - Create basic charts (scatter, bar, line) - Encode data fields to visual properties - Aggregate and transform data within charts - Customize colors, scales, and labels - Layer multiple chart elements - Build publication-quality heatmaps
Part 1: What is Declarative Visualization?
Think of ordering food at a restaurant. You don’t walk into the kitchen and say “heat the pan to 375°F, dice the onions, sauté for 3 minutes…” — you just say “I’d like the pasta.” That’s the difference between imperative and declarative.
Imperative (matplotlib) — you specify how to draw, step by step. You manage coordinates, colors, labels, legends, and layout yourself:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(6, 3))
cities = ['Seattle', 'New York', 'Chicago']
temps = [53.7, 52.7, 48.7]
colors = ['#4c78a8', '#f58518', '#e45756']
bars = ax.barh(cities, temps, color=colors)
ax.set_xlabel('Average Temperature (°F)')
ax.set_title('Average Temperature by City')
ax.bar_label(bars, fmt='%.1f')
ax.set_xlim(0, 65)
plt.tight_layout()
plt.show()Declarative (Altair) — you describe what you want to see. You state the relationships between your data and visual properties. Altair handles scales, axes, labels, and layout automatically:
alt.Chart(weather).mark_bar().encode(
x='average(temp):Q',
y='city:N',
color='city:N'
)The key difference: with matplotlib you compute the averages yourself, position each bar, pick colors, format labels, and manage layout. With Altair you declare “show average temperature by city, color by city” and the library does the rest — including aggregation, axis scaling, and a legend.
Part 2: Setup
Let’s create a simple dataset to work with—monthly precipitation for three cities:
weather = pd.DataFrame({
'city': ['Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle', 'Seattle',
'New York', 'New York', 'New York', 'New York', 'New York', 'New York',
'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago', 'Chicago'],
'month': ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun'],
'precip': [5.2, 3.9, 4.1, 2.8, 2.1, 1.6,
3.6, 3.1, 4.2, 4.0, 4.5, 4.2,
2.0, 1.9, 2.6, 3.7, 4.1, 4.0],
'temp': [42, 45, 50, 55, 62, 68,
35, 38, 48, 58, 68, 77,
28, 32, 42, 52, 64, 74]
})
weather| city | month | precip | temp | |
|---|---|---|---|---|
| 0 | Seattle | Jan | 5.2 | 42 |
| 1 | Seattle | Feb | 3.9 | 45 |
| 2 | Seattle | Mar | 4.1 | 50 |
| 3 | Seattle | Apr | 2.8 | 55 |
| 4 | Seattle | May | 2.1 | 62 |
| 5 | Seattle | Jun | 1.6 | 68 |
| 6 | New York | Jan | 3.6 | 35 |
| 7 | New York | Feb | 3.1 | 38 |
| 8 | New York | Mar | 4.2 | 48 |
| 9 | New York | Apr | 4.0 | 58 |
| 10 | New York | May | 4.5 | 68 |
| 11 | New York | Jun | 4.2 | 77 |
| 12 | Chicago | Jan | 2.0 | 28 |
| 13 | Chicago | Feb | 1.9 | 32 |
| 14 | Chicago | Mar | 2.6 | 42 |
| 15 | Chicago | Apr | 3.7 | 52 |
| 16 | Chicago | May | 4.1 | 64 |
| 17 | Chicago | Jun | 4.0 | 74 |
This is tidy data: each row is one observation, each column is one variable. Altair works best with tidy data.
Part 3: Your First Chart
The Three Building Blocks
Every Altair chart has three components:
- Data — a pandas DataFrame
- Mark — the visual shape (point, bar, line, etc.)
- Encoding — which data fields map to which visual properties
Creating a Chart Object
Start by wrapping your DataFrame in alt.Chart():
# This creates a chart object - it stores data but can't render without a mark
chart = alt.Chart(weather)
print(type(chart)) # It's an Altair Chart object<class 'altair.vegalite.v6.api.Chart'>
The chart object exists but can’t display—Altair requires a mark to render. Let’s add one.
Adding a Mark
# mark_point() draws circles
alt.Chart(weather).mark_point()We see points, but they’re all stacked on top of each other. We need encodings to spread them out.
Adding Encodings
Encodings map data columns to visual channels like position (x, y), color, size, etc.
alt.Chart(weather).mark_point().encode(
x='precip',
y='city'
)Now each point is positioned: - Horizontally by precipitation value - Vertically by city name
Notice how Altair automatically: - Created axis labels from column names - Scaled the x-axis to fit the data - Separated cities on the y-axis
Different Mark Types
Altair provides many mark types. Here are the most common:
# Bar chart
alt.Chart(weather).mark_bar().encode(
x='precip',
y='city'
)# Line chart
alt.Chart(weather).mark_line().encode(
x='month',
y='precip'
)The line chart connects all points. We’ll learn how to separate by city later using color encoding.
📝 Exercise 1: Create a scatter plot with
tempon x-axis andprecipon y-axis.
Part 4: Data Types
Altair needs to know the type of each data field to choose appropriate scales and displays:
| Type | Code | Description | Example |
|---|---|---|---|
| Quantitative | :Q |
Numerical values | Temperature, price |
| Nominal | :N |
Categories (no order) | City names, colors |
| Ordinal | :O |
Ordered categories | Small/Medium/Large |
| Temporal | :T |
Date/time | 2024-01-15 |
You specify types by adding them after the field name with a colon:
# Explicit type annotations
alt.Chart(weather).mark_bar().encode(
x='precip:Q', # Quantitative
y='city:N' # Nominal
)Altair usually guesses correctly, but explicit types prevent surprises.
Controlling Sort Order
By default, Altair sorts axis values alphabetically. To get chronological order, you must explicitly specify the sort order:
# Default: sorted alphabetically (Apr, Feb, Jan, Jun, Mar, May)
alt.Chart(weather).mark_bar().encode(
x='month:O',
y='average(precip):Q'
)# Explicit sort: chronological order
alt.Chart(weather).mark_bar().encode(
x=alt.X('month:O', sort=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun']),
y='average(precip):Q'
)📝 Exercise 2: Create a bar chart showing average temperature per month with proper chronological order.
In Altair, you can aggregate directly in the encoding string:
alt.Chart(weather).mark_bar().encode(
x='average(precip):Q', # Average of precip column
y='city:N'
)Altair automatically grouped by city and calculated the average for each.
Available Aggregation Functions
count()— number of rowssum(field)— totalaverage(field)ormean(field)— averagemedian(field)— medianmin(field)/max(field)— extremesstdev(field)— standard deviation
# Count observations per city
alt.Chart(weather).mark_bar().encode(
x='count():Q',
y='city:N'
)# Max temperature per city
alt.Chart(weather).mark_bar().encode(
x='max(temp):Q',
y='city:N'
)📝 Exercise 3: Create a bar chart showing total precipitation per month (across all cities).
Part 6: Color Encoding
alt.Chart(weather).mark_line().encode(
x='month:O',
y='precip:Q',
color='city:N' # Different color for each city
)Each city now has its own line with a distinct color. Altair added a legend automatically.
Color for Quantitative Data
You can also map numeric values to color intensity. See Vega Color Schemes for all available palettes.
alt.Chart(weather).mark_circle(size=100).encode(
x='month:O',
y='city:N',
color='precip:Q' # Color intensity shows precipitation
)# Try different color schemes from Vega
alt.Chart(weather).mark_circle(size=100).encode(
x='month:O',
y='city:N',
color=alt.Color('precip:Q', scale=alt.Scale(scheme='viridis')) # Try: 'plasma', 'inferno', 'magma', 'turbo', 'blues', 'greens', 'oranges', 'reds', 'purples', 'goldred', 'redyellowblue'
)Darker colors indicate higher precipitation. This is the foundation of a heatmap!
📝 Exercise 4: Create a scatter plot of
tempvsprecipwith color encoding forcity.
alt.Chart(weather).mark_point(
color='firebrick', # Fixed color for all points
size=100, # Fixed size
opacity=0.7 # Transparency
).encode(
x='temp:Q',
y='precip:Q'
)alt.Chart(weather).mark_bar(color='steelblue').encode(
x='average(precip):Q',
y='city:N'
).properties(
width=400,
height=150,
title='Average Precipitation by City'
)Axis and Scale Customization
For more control, use alt.X() and alt.Y() objects instead of strings:
alt.Chart(weather).mark_bar(color='teal').encode(
x=alt.X(
'average(precip):Q',
title='Average Precipitation (inches)', # Custom axis title
scale=alt.Scale(domain=[0, 6]) # Fixed axis range
),
y=alt.Y(
'city:N',
title='City',
axis=alt.Axis(labelFontSize=12) # Larger labels
)
).properties(
width=400,
height=150
)Color Schemes
Altair includes many built-in color schemes:
alt.Chart(weather).mark_circle(size=200).encode(
x='month:O',
y='city:N',
color=alt.Color(
'precip:Q',
scale=alt.Scale(scheme='blues') # Blue color gradient
)
).properties(width=300, height=150)Popular schemes: 'blues', 'greens', 'oranges', 'viridis', 'goldred', 'redyellowblue'
📝 Exercise 5: Create a bar chart of average temperature per city with orange bars and a title.
# Heatmap base
heatmap = alt.Chart(weather).mark_rect().encode(
x='month:O',
y='city:N',
color=alt.Color('precip:Q', scale=alt.Scale(scheme='goldred'))
)
# Text with conditional color
text = alt.Chart(weather).mark_text(
fontSize=12,
fontWeight='bold'
).encode(
x='month:O',
y='city:N',
text=alt.Text('precip:Q', format='.1f'),
color=alt.condition(
alt.datum.precip > 3.5, # If precip > 3.5
alt.value('white'), # Use white text
alt.value('black') # Otherwise black
)
)
(heatmap + text).properties(width=300, height=150)Now high values have white text (readable on dark red) and low values have black text.
# Data with huge range
wide_range = pd.DataFrame({
'category': ['A', 'B', 'C', 'D'],
'value': [10, 100, 1000, 50000]
})
# Linear scale - small values barely visible
alt.Chart(wide_range).mark_bar().encode(
x='category:N',
y='value:Q'
).properties(title='Linear Scale')📝 Exercise 6: Create a bar chart of average temperature per city with text labels on the bars.
# Log color scale - shows variation across all magnitudes
alt.Chart(wide_range).mark_rect().encode(
x='category:N',
y=alt.value(1), # Single row
color=alt.Color(
'value:Q',
scale=alt.Scale(scheme='goldred', type='log')
)
).properties(width=300, height=50, title='Log Color Scale')Part 8: Real-World Example — SRA Metadata
The Sequence Read Archive (SRA) is the largest public repository of sequencing data. Here we analyze SARS-CoV-2 metadata to understand how sequencing platforms and library protocols were used during the pandemic.
# Load SRA metadata snapshot from Zenodo (first 100k records for speed)
sra = pd.read_csv(
"https://zenodo.org/records/10680776/files/ena.tsv.gz",
compression='gzip',
sep="\t",
low_memory=False,
nrows=100000
)
sra.sample(3)| study_accession | base_count | accession | collection_date | country | culture_collection | description | sample_collection | sample_title | sequencing_method | ... | library_name | library_construction_protocol | library_layout | instrument_model | instrument_platform | isolation_source | isolate | investigation_type | collection_date_submitted | center_name | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35094 | PRJEB43060 | 242603897.0 | SAMEA13362767 | 2020-10-03 | Norway | NaN | Illumina MiSeq sequencing; Raw reads: SARS-CoV... | NaN | SARS-CoV-2/human/Norway/4099/2020/1 | NaN | ... | NaN | NaN | PAIRED | Illumina MiSeq | ILLUMINA | not provided | SARS-CoV-2/human/Norway/4099/2020 | NaN | 2020-10-03 | Norwegian Institute of Public Health (NIPH) |
| 27221 | PRJEB52934 | 82336750.0 | SAMEA14459592 | 2022-04-16 | Estonia | NaN | Illumina NovaSeq 6000 sequencing; Illumina Nov... | NaN | PCR tiled amplicon WGS of SARS-Cov-2, pre-sele... | NaN | ... | EstECDC0112 | Produced for Workflow v.1.8.9 (Eurofins Genom... | PAIRED | Illumina NovaSeq 6000 | ILLUMINA | NaN | NaN | NaN | 2022-04-16 | Health Board of Estonia |
| 82022 | PRJEB37886 | 855160155.0 | SAMEA10170806 | 2021-09-13 | United Kingdom | NaN | Illumina NovaSeq 6000 sequencing; Illumina Nov... | NaN | COG-UK/ALDP-1E70310 | NaN | ... | NT1696099F / HT-119742:F1 | NaN | PAIRED | Illumina NovaSeq 6000 | ILLUMINA | NaN | NaN | NaN | 2021-09-13 | SC |
3 rows × 32 columns
⚠️ Data Quality: The metadata is only as good as who entered it. Always validate date ranges!
Aggregate for Visualization
# Group by platform and library strategy, count unique runs
heatmap_data = sra.groupby(
['instrument_platform', 'library_strategy']
).agg(
{'run_accession': 'nunique'}
).reset_index()
heatmap_data| instrument_platform | library_strategy | run_accession | |
|---|---|---|---|
| 0 | BGISEQ | AMPLICON | 1 |
| 1 | BGISEQ | OTHER | 13 |
| 2 | BGISEQ | RNA-Seq | 2 |
| 3 | BGISEQ | Targeted-Capture | 2 |
| 4 | DNBSEQ | AMPLICON | 3 |
| 5 | ILLUMINA | AMPLICON | 78448 |
| 6 | ILLUMINA | OTHER | 3 |
| 7 | ILLUMINA | RNA-Seq | 554 |
| 8 | ILLUMINA | Targeted-Capture | 273 |
| 9 | ILLUMINA | WCS | 2 |
| 10 | ILLUMINA | WGA | 1389 |
| 11 | ILLUMINA | WGS | 1020 |
| 12 | ILLUMINA | miRNA-Seq | 3 |
| 13 | ION_TORRENT | AMPLICON | 1752 |
| 14 | ION_TORRENT | RNA-Seq | 3 |
| 15 | ION_TORRENT | WGA | 11 |
| 16 | ION_TORRENT | WGS | 24 |
| 17 | OXFORD_NANOPORE | AMPLICON | 7407 |
| 18 | OXFORD_NANOPORE | OTHER | 2 |
| 19 | OXFORD_NANOPORE | RNA-Seq | 296 |
| 20 | OXFORD_NANOPORE | WGA | 134 |
| 21 | OXFORD_NANOPORE | WGS | 227 |
| 22 | PACBIO_SMRT | AMPLICON | 8411 |
| 23 | PACBIO_SMRT | RNA-Seq | 13 |
| 24 | PACBIO_SMRT | Targeted-Capture | 6 |
| 25 | PACBIO_SMRT | WGS | 1 |
Create the Heatmap
# Basic heatmap
alt.Chart(heatmap_data).mark_rect().encode(
x='instrument_platform:N',
y='library_strategy:N',
color='run_accession:Q'
)Final Polished Heatmap
# Background: colored rectangles
background = alt.Chart(heatmap_data).mark_rect(opacity=1).encode(
x=alt.X(
'instrument_platform:N',
title='Sequencing Platform'
),
y=alt.Y(
'library_strategy:N',
title='Library Strategy',
axis=alt.Axis(orient='right')
),
color=alt.Color(
'run_accession:Q',
title='# Samples',
scale=alt.Scale(
scheme='goldred',
type='log' # Log scale for color!
)
),
tooltip=[
alt.Tooltip('instrument_platform:N', title='Platform'),
alt.Tooltip('library_strategy:N', title='Strategy'),
alt.Tooltip('run_accession:Q', title='Number of runs', format=',')
]
).properties(
width=500,
height=200,
title={
'text': 'SARS-CoV-2 Sequencing in ENA',
'subtitle': 'By Platform and Library Strategy (100k sample)'
}
)
backgroundAdd Text Labels
# Text layer with conditional coloring
text_labels = background.mark_text(
align='center',
baseline='middle',
fontSize=11,
fontWeight='bold'
).encode(
text=alt.Text('run_accession:Q', format=','), # Comma-formatted numbers
color=alt.condition(
alt.datum.run_accession > 200, # If value > 200
alt.value('white'), # White text (on dark background)
alt.value('black') # Black text (on light background)
)
)
# Combine layers
background + text_labelsThis visualization reveals: - ILLUMINA + AMPLICON dominates (78k+ samples) — Illumina short-reads with PCR amplification - PACBIO_SMRT also heavily uses AMPLICON protocol - RNA-Seq is relatively rare compared to AMPLICON - Some platform/strategy combinations have very few samples
📝 Exercise 7: Create a bar chart showing the top 5 countries by number of SRA submissions.
Summary
| Concept | Syntax |
|---|---|
| Create chart | alt.Chart(df) |
| Add marks | .mark_point(), .mark_bar(), .mark_line(), .mark_rect() |
| Encode data | .encode(x='col', y='col') |
| Data types | :Q (quantitative), :N (nominal), :O (ordinal), :T (temporal) |
| Aggregation | 'average(col):Q', 'sum(col):Q', 'count():Q' |
| Color encoding | color='col:N' or color=alt.Color('col:Q', scale=alt.Scale(scheme='blues')) |
| Customization | alt.X('col', title='Label', scale=alt.Scale(...)) |
| Properties | .properties(width=400, height=200, title='Title') |
| Layer charts | chart1 + chart2 |
| Conditional | alt.condition(predicate, if_true, if_false) |
| Log scale | scale=alt.Scale(type='log') |
Further Resources
- Altair Documentation — Official docs with tutorials
- Altair Example Gallery — Hundreds of examples to copy
- Vega-Lite — The underlying grammar Altair uses
- Vega Color Schemes — All available color palettes