Research Plan Skill
Take a research question (from /hypothesis, user text, or docs/research_ideas.md), refine it through literature review and data feasibility checks, and produce a structured research plan document.
Workflow
Step 1: Accept Input
Identify the research question source:
- •If invoked after
/hypothesis: use the generated hypothesis - •If the user provides a question directly: use that
- •If neither: read
docs/research_ideas.mdand present PROPOSED ideas for the user to choose
Confirm you have:
- •A research question
- •A tentative hypothesis (H0 and H1)
- •A target organism, pathway, or data type
If any of these are missing, ask the user.
Step 2: Literature Check
Invoke /literature-review internally to search for existing work:
- •Search for the specific research question / hypothesis
- •Identify: prior results, methods used, organisms studied, gaps
- •Present findings to the user: "Here's what's already known about this topic..."
- •Store references in the project's
references.md(created by/literature-review)
Step 3: Interactive Refinement Loop
This is the key differentiator — iterate with the user based on what the literature reveals:
- •Present the literature context and ask: "Given what's already known, do you want to refine the hypothesis?"
- •Offer concrete options:
- •Narrow scope: Focus on a specific organism, phylum, or gene category
- •Change organism: Switch to a species with better data coverage in BERDL
- •Adjust approach: Use a different statistical method or comparison
- •Pivot question: The literature reveals a more interesting gap to address
- •Proceed as-is: The original hypothesis is still novel and testable
- •If the user refines, run additional targeted literature searches as needed
- •Allow 1-3 iterations until the user is satisfied
Step 4: Data Feasibility Check
Verify the hypothesis can actually be tested with BERDL data:
- •Table verification: Use the
/berdlREST API (read-only) to confirm:- •The required tables exist and have the expected columns
- •Use the schema endpoint to check column names and types
- •Coverage check: Query row counts for the relevant tables:
- •How many species/genomes are available?
- •What fraction have the needed annotations? (e.g., "28% of genomes have environmental embeddings")
- •Pitfall scan: Read
docs/pitfalls.mdanddocs/performance.mdfor known issues with the target tables - •Performance tier: Estimate whether the analysis can be done via REST API or requires JupyterHub:
| Expected Scale | Tier | Recommendation |
|---|---|---|
| < 100K rows | REST API | Direct queries, .toPandas() OK |
| 100K – 10M rows | Mixed | Filter/aggregate in SQL, small results via REST |
| > 10M rows | JupyterHub only | PySpark DataFrames, no .toPandas() |
Present the feasibility summary to the user. If the data doesn't support the hypothesis, suggest alternatives.
Step 5: Produce Research Plan Document
Generate projects/{project_id}/research_plan.md:
# Research Plan: {Title}
## Research Question
{Refined question after literature review}
## Hypothesis
- **H0**: {Null hypothesis}
- **H1**: {Alternative hypothesis}
## Literature Context
{Summary of what's known, key references, identified gaps}
{Full references stored in projects/{id}/references.md}
## Query Strategy
### Tables Required
| Table | Purpose | Estimated Rows | Filter Strategy |
|---|---|---|---|
| {table} | {why needed} | {count} | {how to filter} |
### Key Queries
1. **{Description}**:
```sql
{query}
- •...
Performance Plan
- •Tier: {REST API / JupyterHub}
- •Estimated complexity: {simple / moderate / complex}
- •Known pitfalls: {list from pitfalls.md}
Analysis Plan
Notebook 1: Data Exploration
- •Goal: {what to verify/explore}
- •Expected output: {CSV/figures}
Notebook 2: Main Analysis
- •Goal: {core analysis}
- •Expected output: {CSV/figures}
Notebook 3: Visualization (if needed)
- •Goal: {figures for findings}
Expected Outcomes
- •If H1 supported: {interpretation}
- •If H0 not rejected: {interpretation}
- •Potential confounders: {list}
Authors
{from user or carried forward from /hypothesis}
### Step 6: Create Project Directory Structure Create the project directory with initial files:
projects/{project_id}/ ├── research_plan.md # The plan document from Step 5 ├── references.md # Created by /literature-review in Step 2 ├── README.md # Skeleton with question/hypothesis filled in ├── notebooks/ # Empty, populated by /notebook ├── data/ # Empty, populated during analysis └── figures/ # Empty, populated during analysis
Generate a skeleton `README.md` following the structure of existing projects (see `projects/pangenome_openness/README.md` for reference):
```markdown
# {Title}
## Research Question
{Refined question}
## Hypothesis
{H0 and H1}
## Approach
{High-level approach from the research plan}
## Data Sources
- **Database**: {database name} on BERDL Delta Lakehouse
- **Tables**: {list of tables with brief descriptions}
## Key Findings
*TBD — run notebooks and use `/synthesize` to complete this section.*
## Notebooks
| Notebook | Purpose |
|----------|---------|
| *To be generated by `/notebook`* | |
## Visualizations
| Figure | Description |
|--------|-------------|
| *TBD* | |
## Data Files
| File | Description |
|------|-------------|
| *TBD* | |
## Related Projects
{Any prior projects this builds on}
## Authors
{Authors}
## Future Directions
*TBD — use `/synthesize` to complete this section.*
Step 7: Suggest Next Steps
After creating the plan, tell the user:
"Research plan created at
projects/{project_id}/research_plan.md. Next steps:
- •Use
/notebookto generate analysis notebooks from this plan- •Upload notebooks to BERDL JupyterHub and run them
- •Use
/synthesizeto interpret results and draft findings"
Integration
- •Reads from:
/hypothesisoutput,docs/research_ideas.md,docs/pitfalls.md,docs/performance.md - •Calls:
/literature-review(for literature search),/berdl(read-only schema/count checks) - •Produces:
research_plan.md, skeletonREADME.md, project directory structure - •Consumed by:
/notebook(readsresearch_plan.mdto generate notebooks)
Pitfall Detection
When you encounter errors, unexpected results, retry cycles, performance issues, or data surprises during this task, follow the pitfall-capture protocol. Read .claude/skills/pitfall-capture/SKILL.md and follow its instructions to determine whether the issue should be added to docs/pitfalls.md.