Add Step (p2cs_project)
When to Use
Use this skill when:
- •A new pipeline step needs to be added under
code/step_{N}_{name}/. - •A step should be inserted between existing steps, requiring renumbering of later steps.
- •You need a checklist for creating the step class, substeps, README,
explore.ipynb, and wiring dependencies/config.
This skill is project-specific to p2cs_project and assumes the architecture described in the root README.md.
Overview: Step Pattern
Each numbered step follows the same high-level pattern:
- •Directory:
code/step_{N}_{snake_name}/- •
step.py: mainStep{N}{CamelName}class inheriting frombase.Step. - •
data_classes.py: data classes inheriting fromDataFilesubclasses (when the step has new artifacts). - •
__init__.py: re-exports the main step class (and sometimes helper symbols). - •
README.md: step-specific documentation (inputs, outputs, substeps, external tools). - •
explore.ipynb: exploration notebook following the standard notebook structure from the rootREADME.md. - •Numbered substeps:
1_*.py,2_*.py, etc., each implementing focused logic.
- •
- •Outputs in
data/step_{N}_{snake_name}/and figures infigures/step_{N}_{snake_name}/are routed viabase.paths. - •The step is registered in:
- •
pipeline_config.yamlundersteps: step_{N}_{snake_name}: ... - •Tests in
code/tests/test_step_{N}.py.
- •
Good templates to copy from:
- •Simple single-substep data step:
step_4_prepare_pairs/ - •Multi-substep data/tooling step:
step_2_organism_distance/ - •Modeling/evaluation step:
step_6_train_model/,step_7_crosstalk_estimation/
Workflow A: Append a New Step at the End
- •
Determine the new step index and name
- •Inspect existing numbered steps under
code/step_*. - •Let
N_maxbe the largest index (currently8forstep_8_generate_paper). - •Choose:
- •New index:
N_new = N_max + 1 - •Snake name:
step_{N_new}_{snake_name} - •Class name:
Step{N_new}{CamelName}
- •New index:
- •Inspect existing numbered steps under
- •
Create the step directory
- •Create
code/step_{N_new}_{snake_name}/with at least:- •
__init__.py(re-export the main step class). - •
step.py(main step implementation). - •
README.md(step documentation). - •
explore.ipynb(exploration notebook). - •One or more numbered substeps
1_*.py,2_*.py, etc. - •
data_classes.pyand/orconfig.jsonif this step defines new data types or config.
- •
- •Recommended pattern: Copy the closest existing step directory (e.g.,
step_4_prepare_pairs/) and rename/trim to match the new step’s responsibilities.
- •Create
- •
Implement the main step class
- •In
step.py:- •Import
Stepand path helpers fromcode/base/step.pyandcode/base/paths.py. - •Define a class like:
- •
class Step{N_new}{CamelName}(Step):
- •
- •Implement:
- •
nameanddescriptionproperties (or class attributes). - •
dependenciesproperty returning aList[str]of upstream steps, using canonical IDs like"step_1_get_p2cs_data". - •
get_input_paths()andget_output_paths()using data classes andget_step_input_path/get_step_output_path. - •
run()orchestrating any substeps viaself.run_substeps(...).
- •
- •Import
- •In
- •
Define data classes (if needed)
- •In
data_classes.py:- •Inherit from appropriate
DataFilesubclasses (e.g.,PickleDataFile,CSVDataFile,NumpyDataFile). - •Define schemas, descriptions, and default loaders/savers as in existing steps.
- •Inherit from appropriate
- •Use these data classes in
get_input_paths()/get_output_paths()and in substeps.
- •In
- •
Create numbered substeps
- •Add scripts
1_*.py,2_*.py, etc. inside the new step directory. - •Follow existing substep patterns:
- •Each substep is a small class/function using the step’s data classes and
pathshelpers. - •The main step’s
run()callsself.run_substeps(...)with:- •Substep objects
- •
step_numbers=[1, 2, ...] - •
descriptions=[...] - •Appropriate
on_failuremode ("strict"or"warning").
- •Each substep is a small class/function using the step’s data classes and
- •Add scripts
- •
Create the step README
- •In
README.md, mirror the structure used in other steps:- •Short description.
- •Inputs (data classes, upstream steps).
- •Outputs and their data classes.
- •Substeps and what they do.
- •Any external tools / configs required.
- •In
- •
Create the
explore.ipynbnotebook- •Follow the standard structure from the root
README.md:- •
# Imports(path setup + step/data class imports). - •
# Load Data- •
## Load Inputs(usingstep.get_input_paths()+ data classes). - •
## Load Outputs.
- •
- •
# Plot- •Display saved figures from visualization substeps first.
- •Put any extra exploratory plots after those.
- •
# Notes(short list of exploration ideas).
- •
- •Respect the collapsible headings rule (heading-only markdown cells).
- •Follow the standard structure from the root
- •
Wire the step into
pipeline_config.yaml- •Under
steps:, add a new entry:- •Key:
step_{N_new}_{snake_name}: - •Fields:
enabled,description,overwrite_outputs, optionalfast_plots, andsubsteps:.
- •Key:
- •Add a
substeps:section keyed by the filenames (without.py), matching patterns in other steps.
- •Under
- •
Add tests
- •Create
code/tests/test_step_{N_new}.pyby copying a nearby test (e.g.,test_step_4.py) and adjusting:- •Imports to the new step and data classes.
- •Test names and assertions to cover the new step’s behavior.
- •Create
- •
Run tests / pipeline checks
- •Run
pytest code/tests/test_step_{N_new}.py. - •Optionally run the step via:
- •
cd code && python run_pipeline.py --step step_{N_new}_{snake_name}
- •
- •Run
Workflow B: Insert a Step in the Middle (with Renumbering)
Use this when inserting a new step between existing steps (e.g., between step_3_embed_proteins and step_4_prepare_pairs).
B1. Plan the new ordering
- •
Identify current step order
- •List existing
code/step_*directories and their indices (includingstep_0_draw_theoretical).
- •List existing
- •
Choose insertion point
- •Let:
- •
N_insert_after= index of the step before the new one. - •
N_new = N_insert_after + 1.
- •
- •All steps with index
> N_insert_aftermust be shifted up by 1:- •Old
k→ newk + 1for allk > N_insert_after.
- •Old
- •Let:
- •
Decide the new step’s ID
- •Choose:
- •New directory name:
step_{N_new}_{snake_name}. - •New class name:
Step{N_new}{CamelName}.
- •New directory name:
- •Choose:
B2. Renumber existing steps (highest → lowest)
Perform renaming from highest index down to N_insert_after + 1 to avoid collisions.
For each step index k in descending order where k > N_insert_after:
- •
Compute new index
- •
k_new = k + 1.
- •
- •
Rename step directories
- •Code:
code/step_{k}_{name}/→code/step_{k_new}_{name}/. - •Data:
data/step_{k}_{name}/→data/step_{k_new}_{name}/(if exists). - •Figures:
figures/step_{k}_{name}/→figures/step_{k_new}_{name}/(if exists).
- •Code:
- •
Rename tests
- •
code/tests/test_step_{k}.py→code/tests/test_step_{k_new}.py.
- •
- •
Update configuration keys
- •In
pipeline_config.yaml, change:- •
step_{k}_{name}:→step_{k_new}_{name}:.
- •
- •In
- •
Update string references and imports
- •Use text search for
step_{k}_{name}andtest_step_{k}across the repo and update to the new IDs:- •Imports like
from step_{k}_{name}.... - •Dependency lists in
dependenciesproperties (e.g.,return ["step_{k}_{name}", ...]). - •Any key strings that reference
step_{k}_{name}.
- •Imports like
- •Use text search for
- •
Update doc references
- •In
README.mdfiles and notebooks, update any textual references to the old step name or number, if present.
- •In
B3. Add the new step
After all affected steps k > N_insert_after have been shifted to k + 1:
- •
Create
code/step_{N_new}_{snake_name}/- •Follow Workflow A, steps 2–7 to:
- •Implement
step.pyanddata_classes.py. - •Add numbered substeps.
- •Add
README.md. - •Add
explore.ipynb.
- •Implement
- •Follow Workflow A, steps 2–7 to:
- •
Wire into
pipeline_config.yaml- •Under
steps:add:- •
step_{N_new}_{snake_name}:with its configuration andsubsteps.
- •
- •Under
- •
Update dependencies
- •For the new step:
- •Set
dependenciesto the upstream steps, using the renumbered IDs.
- •Set
- •For downstream steps:
- •Review their
dependenciesproperties:- •Replace any old IDs that were shifted, and add the new step as a dependency where appropriate.
- •Review their
- •For the new step:
- •
Add test file
- •Create
code/tests/test_step_{N_new}.pyfollowing neighboring step tests.
- •Create
- •
Sanity check references
- •Run a repo-wide search for any old step IDs (
step_{k}_{name}wherekwas renumbered) and ensure:- •All references are either removed or updated to the new IDs.
- •Run a repo-wide search for any old step IDs (
B4. Validate after renumbering
- •
Run targeted tests
- •Run:
- •
cd code && pytest tests/test_step_{N_new}.py - •Plus tests for all renumbered steps:
test_step_{k_new}.py.
- •
- •Run:
- •
Run a dry pipeline
- •Optionally run:
- •
cd code && python run_pipeline.py --list-stepsto confirm updated IDs and ordering. - •
cd code && python run_pipeline.py --step step_{N_new}_{snake_name}to test the new step in context.
- •
- •Optionally run:
Notebook Guidelines (Quick Reference)
When creating or editing explore.ipynb for a step:
- •Follow the standard sections:
- •
# Imports - •
# Load Data- •
## Load Inputs - •
## Load Outputs
- •
- •
# Plot - •
# Notes
- •
- •Ensure each heading is in its own markdown cell to enable collapsible sections.
- •Use the data classes for loading inputs/outputs, not raw paths.
- •Display visualization substep figures first under
# Plot; additional exploratory plots come after.
Usage Summary
When asked to add a new step:
- •Decide whether it is an append (Workflow A) or insert with renumbering (Workflow B).
- •Follow the appropriate workflow carefully, especially:
- •Directory and file naming:
step_{N}_{snake_name},test_step_{N}.py. - •Dependency updates and imports.
- •
pipeline_config.yamlstep and substep entries. - •
explore.ipynbstructure and data class usage.
- •Directory and file naming:
- •Always finish by running the relevant tests and, if feasible, a pipeline run of the new step.