Jupyter Notebook

Guidelines for creating and managing Jupyter notebooks in OpenHEXA.

Where to Save Notebooks

Default Location

Save notebooks in the notebooks/ folder unless the user specifies an alternative location.

code

workspace/
├── notebooks/           <- Default location for notebooks
│   ├── analysis.ipynb
│   └── exploration.ipynb
├── pipelines/
└── dashboards/

Naming Conventions

code

# Good names
data_exploration.ipynb
dhis2_indicator_analysis.ipynb
monthly_report_2024.ipynb
tracker_data_quality.ipynb

# Avoid
Untitled.ipynb
test.ipynb
final_final_v2.ipynb

Notebook Structure

Recommended Cell Order

code

1. Title & Description (Markdown)
2. Imports & Setup (Code)
3. Configuration/Parameters (Code)
4. Data Loading (Code)
5. Analysis Sections (Markdown + Code alternating)
6. Results/Summary (Markdown)

Cell 1: Title & Description

markdown

# Notebook Title

**Purpose:** Brief description of what this notebook does

**Data Sources:**
- Table: `my_table` from workspace database
- DHIS2: Connection `dhis2_connection_name`

**Author:** Name
**Date:** YYYY-MM-DD

Cell 2: Imports & Setup

python

# Standard imports
import os
import pandas as pd
import numpy as np
from datetime import datetime

# Database
from sqlalchemy import create_engine

# Visualization
import matplotlib.pyplot as plt

# OpenHEXA (if using toolbox)
from openhexa.sdk import workspace
from openhexa.toolbox.dhis2 import DHIS2

# Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)

print("Setup complete")

Cell 3: Configuration

python

# Parameters - easy to modify
START_DATE = "2024-01-01"
END_DATE = "2024-12-31"
ORG_UNIT_LEVEL = 2

# Database connection
DB_URL = os.environ.get("WORKSPACE_DATABASE_URL")
engine = create_engine(DB_URL)

# DHIS2 connection (if needed)
dhis2_conn = workspace.dhis2_connection("connection_name")
dhis = DHIS2(dhis2_conn, cache_dir=f"{workspace.files_path}/.cache")

print(f"Analysis period: {START_DATE} to {END_DATE}")

Cell 4: Data Loading

python

# Load data with clear variable names
df_raw = pd.read_sql("""
    SELECT * FROM source_table
    WHERE date >= %(start)s AND date <= %(end)s
""", engine, params={"start": START_DATE, "end": END_DATE})

print(f"Loaded {len(df_raw):,} rows")
df_raw.head()

Best Practices

1. One Concept Per Cell

python

# GOOD: One operation per cell
# Cell 1: Filter data
df_filtered = df_raw[df_raw["status"] == "active"]

# Cell 2: Aggregate
df_summary = df_filtered.groupby("category").agg({"value": "sum"})

python

# BAD: Too much in one cell
df_filtered = df_raw[df_raw["status"] == "active"]
df_summary = df_filtered.groupby("category").agg({"value": "sum"})
df_summary.plot(kind="bar")
plt.savefig("chart.png")

2. Use Markdown Headers

markdown

## 1. Data Exploration

### 1.1 Missing Values

### 1.2 Distribution Analysis

## 2. Data Transformation

## 3. Results

3. Show Progress & Shapes

python

# After each transformation, show the result
df_clean = df_raw.dropna()
print(f"After cleaning: {len(df_clean):,} rows (dropped {len(df_raw) - len(df_clean):,})")
df_clean.shape

4. Document Decisions

python

# Remove outliers using IQR method
# Rationale: Values > 3x IQR are likely data entry errors
Q1 = df["value"].quantile(0.25)
Q3 = df["value"].quantile(0.75)
IQR = Q3 - Q1

df_no_outliers = df[
    (df["value"] >= Q1 - 1.5 * IQR) &
    (df["value"] <= Q3 + 1.5 * IQR)
]
print(f"Removed {len(df) - len(df_no_outliers)} outliers")

5. Avoid Hardcoded Values

python

# BAD
df = df[df["org_unit"] == "ImspTQPwCqd"]

# GOOD
TARGET_ORG_UNIT = "ImspTQPwCqd"  # Define in config cell
df = df[df["org_unit"] == TARGET_ORG_UNIT]

Writing to Database

When writing results to the database, follow the sql-tables skill guidelines:

python

from sqlalchemy import String, Float, Integer, DateTime

# 1. Check if table exists
from sqlalchemy import inspect
inspector = inspect(engine)
if "my_results" in inspector.get_table_names():
    print("WARNING: Table 'my_results' exists. Will use 'my_results_new'")
    table_name = "my_results_new"
else:
    table_name = "my_results"

# 2. Define explicit types
dtype_mapping = {
    "org_unit": String(11),
    "period": String(10),
    "value": Float,
    "created_at": DateTime,
}

# 3. Show proposed types for confirmation
print(f"Will write to table: {table_name}")
print("Column types:")
for col, dtype in dtype_mapping.items():
    print(f"  - {col}: {dtype}")

python

# 4. Write after user confirmation
df_results.to_sql(
    name=table_name,
    con=engine,
    if_exists="replace",
    index=False,
    dtype=dtype_mapping
)
print(f"Written {len(df_results):,} rows to {table_name}")

Visualizations

Basic Chart Template

python

fig, ax = plt.subplots(figsize=(10, 6))

df_summary.plot(kind="bar", ax=ax)

ax.set_title("Chart Title", fontsize=14, fontweight="bold")
ax.set_xlabel("X Label")
ax.set_ylabel("Y Label")
ax.legend(title="Legend")

plt.tight_layout()
plt.show()

Save Figures

python

# Save to workspace files
output_path = f"{workspace.files_path}/figures/chart.png"
plt.savefig(output_path, dpi=150, bbox_inches="tight")
print(f"Saved to: {output_path}")

Cleanup Cell (Optional)

python

# Close connections at the end
engine.dispose()
print("Connections closed")

Common Patterns

Load Multiple Tables

python

tables_to_load = ["table_a", "table_b", "table_c"]
dataframes = {}

for table in tables_to_load:
    dataframes[table] = pd.read_sql(f"SELECT * FROM {table}", engine)
    print(f"Loaded {table}: {len(dataframes[table]):,} rows")

Parameterized Analysis

python

def analyze_org_unit(org_unit_id: str) -> pd.DataFrame:
    """Run standard analysis for an org unit."""
    df = pd.read_sql(f"""
        SELECT * FROM data
        WHERE org_unit = %(ou)s
    """, engine, params={"ou": org_unit_id})

    return df.groupby("indicator").agg({
        "value": ["sum", "mean", "count"]
    })

# Run for multiple org units
results = {}
for ou in ["OU1", "OU2", "OU3"]:
    results[ou] = analyze_org_unit(ou)

Checklist

• Title and description in first cell
• Imports organized at top
• Configuration in dedicated cell
• Markdown headers for sections
• Progress/shape shown after transformations
• No hardcoded values in analysis cells
• Database writes follow sql-tables guidelines
• Saved in notebooks/ folder (or user-specified)