AgentSkillsCN

notebook-checkpoint-pattern

通过检查点文件在流水线中连接 Jupyter Notebook 的模式。触发条件:连接 Notebook、保存输出、构建多 Notebook 流水线。

SKILL.md
--- frontmatter
name: notebook-checkpoint-pattern
description: "Pattern for connecting Jupyter notebooks in a pipeline via checkpoint files. Trigger: connecting notebooks, saving outputs, multi-notebook pipelines"
author: Claude Code
date: 2026-01-06

Notebook Checkpoint Pattern

Experiment Overview

ItemDetails
Date2026-01-06
GoalConnect multiple Jupyter notebooks in a pipeline via saved checkpoint files
EnvironmentPython, Jupyter, scanpy/anndata for single-cell data
StatusSuccess

Context

Scientific analysis pipelines often span multiple notebooks:

  • 1_preprocessing.ipynb - Data loading and QC filtering
  • 2_integration.ipynb - Cross-modal integration
  • 3_visualization.ipynb - Results visualization

Without checkpoints:

  • Must run all notebooks in sequence in one session
  • Memory issues with large datasets
  • Can't restart from intermediate steps
  • Difficult to share intermediate results

Verified Workflow

Directory Structure

code
project/
├── notebooks/
│   ├── 1_preprocessing.ipynb
│   ├── 2_integration.ipynb
│   └── 3_visualization.ipynb
├── results/
│   ├── 1_preprocessing/
│   │   ├── protein_adata.h5ad
│   │   ├── rna_adata.h5ad
│   │   └── preprocessing_params.json
│   ├── 2_integration/
│   │   ├── matching.pkl
│   │   ├── matching.csv
│   │   ├── arrays.npy
│   │   └── integration_params.json
│   └── 3_visualization/
│       └── figures/
└── data/
    └── (raw input files)

Save Cell Pattern (End of Notebook)

python
# Save outputs to results directory
import os
import json
from datetime import datetime

results_dir = 'results/1_preprocessing'
os.makedirs(results_dir, exist_ok=True)

# Save AnnData objects
protein_adata.write_h5ad(f'{results_dir}/protein_adata.h5ad')
rna_adata.write_h5ad(f'{results_dir}/rna_adata.h5ad')

# Save parameters as JSON (human-readable, version-controllable)
params = {
    'timestamp': datetime.now().isoformat(),
    'filtering_params': {
        'min_counts': MIN_COUNTS,
        'max_counts': MAX_COUNTS,
    },
    'data_shapes': {
        'n_cells': adata.n_obs,
        'n_features': adata.n_vars
    }
}
with open(f'{results_dir}/params.json', 'w') as f:
    json.dump(params, f, indent=2)

print(f"Saved to {results_dir}/")
print(f"Run next_notebook.ipynb next.")

Load Cell Pattern (Start of Notebook)

python
# Load results from previous notebook
import os
import json
import pickle

results_dir = 'results/1_preprocessing'

if not os.path.exists(results_dir):
    raise FileNotFoundError(
        f"Results directory '{results_dir}' not found. "
        f"Run 1_preprocessing.ipynb first."
    )

# Load AnnData objects
protein_adata = sc.read_h5ad(f'{results_dir}/protein_adata.h5ad')
rna_adata = sc.read_h5ad(f'{results_dir}/rna_adata.h5ad')

# Load parameters
with open(f'{results_dir}/params.json', 'r') as f:
    prev_params = json.load(f)

print(f"Loaded from {results_dir}/")
print(f"Previous run: {prev_params['timestamp']}")

Format Recommendations

Data TypeFormatWhy
AnnData objects.h5adNative scanpy, preserves obs/var/uns
NumPy arrays.npyFast, compact, preserves dtype
Sparse matrices.npzEfficient for sparse data
DataFrames.parquetFast, compressed, schema preserved
Small DataFrames.csvHuman-readable for inspection
Python objects.pklArbitrary objects (dicts, lists)
Parameters.jsonHuman-readable, git-friendly

Failed Attempts (Critical)

AttemptWhy it FailedLesson Learned
Save to data/ directoryMixed raw and processed dataUse separate results/ directory
Single flat results/ folderFiles from different notebooks overlapUse subdirectories per notebook
Only save final outputsCan't restart from intermediate stepsSave after each major notebook
Save everything to pickleCan't inspect without loadingUse JSON for params, h5ad for AnnData
Hardcode absolute pathsBreaks on different machinesUse relative paths from notebook
No timestampsCan't tell which run produced outputsInclude timestamp in params JSON

Key Insights

  • Clear directory structure: results/{notebook_name}/ keeps outputs organized
  • Fail fast on missing inputs: Check directory exists before loading
  • Human-readable params: JSON for parameters enables inspection and version control
  • Include timestamps: Know when outputs were generated
  • Print next steps: Tell user which notebook to run next
  • Separate raw from processed: Never overwrite raw data in data/

Git Integration

Add to .gitignore:

gitignore
# Large result files
results/**/*.h5ad
results/**/*.npy
results/**/*.pkl
results/**/*.parquet

# Keep param files for reproducibility
!results/**/*.json
!results/**/*.csv

References