Add Step (p2cs_project)

When to Use

Use this skill when:

•A new pipeline step needs to be added under code/step_{N}_{name}/.
•A step should be inserted between existing steps, requiring renumbering of later steps.
•You need a checklist for creating the step class, substeps, README, explore.ipynb, and wiring dependencies/config.

This skill is project-specific to p2cs_project and assumes the architecture described in the root README.md.

Overview: Step Pattern

Each numbered step follows the same high-level pattern:

•
Directory: code/step_{N}_{snake_name}/
- •step.py: main Step{N}{CamelName} class inheriting from base.Step.
- •data_classes.py: data classes inheriting from DataFile subclasses (when the step has new artifacts).
- •__init__.py: re-exports the main step class (and sometimes helper symbols).
- •README.md: step-specific documentation (inputs, outputs, substeps, external tools).
- •explore.ipynb: exploration notebook following the standard notebook structure from the root README.md.
- •Numbered substeps: 1_*.py, 2_*.py, etc., each implementing focused logic.
•Outputs in data/step_{N}_{snake_name}/ and figures in figures/step_{N}_{snake_name}/ are routed via base.paths.
•
The step is registered in:
- •pipeline_config.yaml under steps: step_{N}_{snake_name}: ...
- •Tests in code/tests/test_step_{N}.py.

Good templates to copy from:

•Simple single-substep data step: step_4_prepare_pairs/
•Multi-substep data/tooling step: step_2_organism_distance/
•Modeling/evaluation step: step_6_train_model/, step_7_crosstalk_estimation/

Workflow A: Append a New Step at the End

•
Determine the new step index and name
- •Inspect existing numbered steps under code/step_*.
- •Let N_max be the largest index (currently 8 for step_8_generate_paper).
- •
  Choose:
  - •New index: N_new = N_max + 1
  - •Snake name: step_{N_new}_{snake_name}
  - •Class name: Step{N_new}{CamelName}
•
Create the step directory
- •
  Create code/step_{N_new}_{snake_name}/ with at least:
  - •__init__.py (re-export the main step class).
  - •step.py (main step implementation).
  - •README.md (step documentation).
  - •explore.ipynb (exploration notebook).
  - •One or more numbered substeps 1_*.py, 2_*.py, etc.
  - •data_classes.py and/or config.json if this step defines new data types or config.
- •Recommended pattern: Copy the closest existing step directory (e.g., step_4_prepare_pairs/) and rename/trim to match the new step’s responsibilities.
•
Implement the main step class
- •
  In step.py:
  - •Import Step and path helpers from code/base/step.py and code/base/paths.py.
  - •Define a class like:
    
    •class Step{N_new}{CamelName}(Step):
  - •Implement:
    
    •name and description properties (or class attributes).
    
    •dependencies property returning a List[str] of upstream steps, using canonical IDs like "step_1_get_p2cs_data".
    
    •get_input_paths() and get_output_paths() using data classes and get_step_input_path / get_step_output_path.
    
    •run() orchestrating any substeps via self.run_substeps(...).
•
Define data classes (if needed)
- •
  In data_classes.py:
  - •Inherit from appropriate DataFile subclasses (e.g., PickleDataFile, CSVDataFile, NumpyDataFile).
  - •Define schemas, descriptions, and default loaders/savers as in existing steps.
- •Use these data classes in get_input_paths() / get_output_paths() and in substeps.
•
Create numbered substeps
- •Add scripts 1_*.py, 2_*.py, etc. inside the new step directory.
- •
  Follow existing substep patterns:
  - •Each substep is a small class/function using the step’s data classes and paths helpers.
  - •The main step’s run() calls self.run_substeps(...) with:
    
    •Substep objects
    
    •step_numbers=[1, 2, ...]
    
    •descriptions=[...]
    
    •Appropriate on_failure mode ("strict" or "warning").
•
Create the step README
- •
  In README.md, mirror the structure used in other steps:
  - •Short description.
  - •Inputs (data classes, upstream steps).
  - •Outputs and their data classes.
  - •Substeps and what they do.
  - •Any external tools / configs required.
•
Create the explore.ipynb notebook
- •
  Follow the standard structure from the root README.md:
  - •# Imports (path setup + step/data class imports).
  - •# Load Data
    
    •## Load Inputs (using step.get_input_paths() + data classes).
    
    •## Load Outputs.
  - •# Plot
    
    •Display saved figures from visualization substeps first.
    
    •Put any extra exploratory plots after those.
  - •# Notes (short list of exploration ideas).
- •Respect the collapsible headings rule (heading-only markdown cells).
•
Wire the step into pipeline_config.yaml
- •
  Under steps:, add a new entry:
  - •Key: step_{N_new}_{snake_name}:
  - •Fields: enabled, description, overwrite_outputs, optional fast_plots, and substeps:.
- •Add a substeps: section keyed by the filenames (without .py), matching patterns in other steps.
•
Add tests
- •
  Create code/tests/test_step_{N_new}.py by copying a nearby test (e.g., test_step_4.py) and adjusting:
  - •Imports to the new step and data classes.
  - •Test names and assertions to cover the new step’s behavior.
•
Run tests / pipeline checks
- •Run pytest code/tests/test_step_{N_new}.py.
- •
  Optionally run the step via:
  - •cd code && python run_pipeline.py --step step_{N_new}_{snake_name}

Workflow B: Insert a Step in the Middle (with Renumbering)

Use this when inserting a new step between existing steps (e.g., between step_3_embed_proteins and step_4_prepare_pairs).

B1. Plan the new ordering

•
Identify current step order
- •List existing code/step_* directories and their indices (including step_0_draw_theoretical).
•
Choose insertion point
- •
  Let:
  - •N_insert_after = index of the step before the new one.
  - •N_new = N_insert_after + 1.
- •
  All steps with index > N_insert_after must be shifted up by 1:
  - •Old k → new k + 1 for all k > N_insert_after.
•
Decide the new step’s ID
- •
  Choose:
  - •New directory name: step_{N_new}_{snake_name}.
  - •New class name: Step{N_new}{CamelName}.

B2. Renumber existing steps (highest → lowest)

Perform renaming from highest index down to N_insert_after + 1 to avoid collisions.

For each step index k in descending order where k > N_insert_after:

•
Compute new index
- •k_new = k + 1.
•
Rename step directories
- •Code: code/step_{k}_{name}/ → code/step_{k_new}_{name}/.
- •Data: data/step_{k}_{name}/ → data/step_{k_new}_{name}/ (if exists).
- •Figures: figures/step_{k}_{name}/ → figures/step_{k_new}_{name}/ (if exists).
•
Rename tests
- •code/tests/test_step_{k}.py → code/tests/test_step_{k_new}.py.
•
Update configuration keys
- •
  In pipeline_config.yaml, change:
  - •step_{k}_{name}: → step_{k_new}_{name}:.
•
Update string references and imports
- •
  Use text search for step_{k}_{name} and test_step_{k} across the repo and update to the new IDs:
  - •Imports like from step_{k}_{name}....
  - •Dependency lists in dependencies properties (e.g., return ["step_{k}_{name}", ...]).
  - •Any key strings that reference step_{k}_{name}.
•
Update doc references
- •In README.md files and notebooks, update any textual references to the old step name or number, if present.

B3. Add the new step

After all affected steps k > N_insert_after have been shifted to k + 1:

•
Create code/step_{N_new}_{snake_name}/
- •
  Follow Workflow A, steps 2–7 to:
  - •Implement step.py and data_classes.py.
  - •Add numbered substeps.
  - •Add README.md.
  - •Add explore.ipynb.
•
Wire into pipeline_config.yaml
- •
  Under steps: add:
  - •step_{N_new}_{snake_name}: with its configuration and substeps.
•
Update dependencies
- •
  For the new step:
  - •Set dependencies to the upstream steps, using the renumbered IDs.
- •
  For downstream steps:
  - •Review their dependencies properties:
    
    •Replace any old IDs that were shifted, and add the new step as a dependency where appropriate.
•
Add test file
- •Create code/tests/test_step_{N_new}.py following neighboring step tests.
•
Sanity check references
- •
  Run a repo-wide search for any old step IDs (step_{k}_{name} where k was renumbered) and ensure:
  - •All references are either removed or updated to the new IDs.

B4. Validate after renumbering

•
Run targeted tests
- •
  Run:
  - •cd code && pytest tests/test_step_{N_new}.py
  - •Plus tests for all renumbered steps: test_step_{k_new}.py.
•
Run a dry pipeline
- •
  Optionally run:
  - •cd code && python run_pipeline.py --list-steps to confirm updated IDs and ordering.
  - •cd code && python run_pipeline.py --step step_{N_new}_{snake_name} to test the new step in context.

Notebook Guidelines (Quick Reference)

When creating or editing explore.ipynb for a step:

•
Follow the standard sections:
- •# Imports
- •
  # Load Data
  - •## Load Inputs
  - •## Load Outputs
- •# Plot
- •# Notes
•Ensure each heading is in its own markdown cell to enable collapsible sections.
•Use the data classes for loading inputs/outputs, not raw paths.
•Display visualization substep figures first under # Plot; additional exploratory plots come after.

Usage Summary

When asked to add a new step:

•Decide whether it is an append (Workflow A) or insert with renumbering (Workflow B).
•
Follow the appropriate workflow carefully, especially:
- •Directory and file naming: step_{N}_{snake_name}, test_step_{N}.py.
- •Dependency updates and imports.
- •pipeline_config.yaml step and substep entries.
- •explore.ipynb structure and data class usage.
•Always finish by running the relevant tests and, if feasible, a pipeline run of the new step.