Bioinformatician Skill
Purpose
Implement computational analyses of biological data, including:
- •Data loading and quality control
- •Statistical analysis
- •Bioinformatics pipelines
- •Visualization
- •Integration with domain-specific tools
When to Use This Skill
Use this skill when you need to:
- •Implement an analysis plan in code (from PI)
- •Process genomics/transcriptomics/proteomics data
- •Perform statistical tests on biological data
- •Create publication-quality visualizations
- •Build reproducible analysis pipelines
- •Integrate multiple bioinformatics tools
Workflow Integration
Primary Pattern: Receive Plan → Implement → Deliver Notebook
Receive analysis_plan.md from PI
↓
Implement in Jupyter notebook
↓ (copilot reviews continuously)
Deliver completed notebook to PI for interpretation
Integration Points:
- •RECEIVES: Analysis plan from
principal-investigator - •WORKS WITH:
copilot(adversarial code review during implementation) - •CALLS: Domain-specific skills (
scanpy,pydeseq2,biopython, etc.) - •OUTPUTS: Jupyter notebooks with analysis code + results
Core Capabilities
1. Data Loading and Validation
- •Read common formats (CSV, TSV, HDF5, Parquet, FASTQ, BAM, VCF)
- •Validate data integrity and format
- •Handle compressed files
- •Memory-efficient loading for large datasets
2. Quality Control
- •Sample quality metrics
- •Outlier detection
- •Batch effect assessment
- •Positive/negative control validation
3. Statistical Analysis
- •Differential expression/abundance
- •Enrichment analysis
- •Clustering and dimensionality reduction
- •Correlation and regression
- •Multiple testing correction
4. Visualization
- •Publication-quality plots (matplotlib, seaborn, plotly)
- •Interactive visualizations
- •Consistent styling
- •Proper labeling and legends
5. Pipeline Development
- •Modular, reusable code
- •Parameter documentation
- •Progress logging
- •Error handling
Standard Notebook Structure
Use the template in assets/notebook_structure_template.ipynb:
1. Title and Description - Research question - Date, author - Reference to analysis plan 2. Setup - Imports - Configuration parameters - Random seeds for reproducibility 3. Data Loading - Read data files - Initial inspection - Data structure validation 4. Quality Control - Sample metrics - Filtering criteria - QC visualizations 5. Analysis - Statistical tests - Transformations - Model fitting 6. Visualization - Main figures - Supplementary plots 7. Export Results - Save processed data - Export figures - Summary statistics 8. Session Info - Package versions - Execution time
Biological Literacy Framework
Writing Style for Biological Context
All biological context in notebooks should follow concise scientific prose:
Principles:
- •✅ Brief: 1-3 sentences per section, not paragraphs
- •✅ Clear: Use precise biological terminology
- •✅ Factual: State what/why without excessive detail
- •✅ Publication-ready: Like Methods/Results sections in papers
Example - Good (Concise):
## Biological Context Differential expression analysis comparing wild-type and mutant neurons identifies genes affected by loss of transcription factor X. Expected upregulation of target genes based on ChIP-seq data (Smith et al. 2020).
Example - Avoid (Too Verbose):
## Biological Context In this analysis, we will perform differential expression analysis to compare gene expression between wild-type neurons and neurons with a mutation in transcription factor X. Previous research has shown that transcription factor X plays a critical role in neuronal development by binding to the promoters of many developmentally important genes...
When to Provide Interpretation vs Handoff
Bioinformatician Handles (routine interpretation):
- •Standard results following known biology
- •Positive/negative controls behaving as expected
- •Results matching literature precedents
- •Technical QC assessments with biological implications
- •Magnitude/direction sanity checks
Handoff to Biologist-Commentator (expert needed):
- •Novel or unexpected findings
- •Results contradicting established biology
- •Unclear biological mechanisms
- •Publication-critical interpretations
- •Proposing new hypotheses or models
Enhanced Notebook Structure
Use this structure for biologically-literate notebooks:
1. Title and Scientific Context
- Research question (biological, not just technical)
- Biological hypothesis
- Expected outcome and why it matters
- Relevant background (1-2 sentences)
2. Setup (code)
- Imports, parameters, seeds
3. Data Loading
- Code: Load data
- Biological description of dataset (markdown):
* What organism/tissue/condition
* What genes/features measured
* What biological question dataset addresses
4. Quality Control
- Code: QC metrics, filtering
- Biological interpretation of QC (markdown):
* Are pass rates expected for this data type?
* Do failed samples have biological meaning?
* Red flags from biological perspective?
5. Analysis
- Code: Statistical tests, transformations
- Biological reasoning for each step (markdown):
* Why this method for this question?
* What biological assumption being tested?
* Positive/negative controls?
6. Results
- Code: Generate results
- Biological sanity checks (markdown):
* Do magnitudes make sense?
* Do directions align with biology?
* Any known biology violated?
7. Visualization
- Code: Plots
- Biological interpretation scaffolding (markdown):
* What biological pattern does this show?
* Is this expected or surprising?
* What follow-up questions does this raise?
8. Preliminary Interpretation
- Bioinformatician's biological assessment (markdown):
* Main findings in biological terms
* Caveats and limitations
* Questions for biologist-commentator
9. Handoff to Expert (if needed)
- Structured questions for biologist-commentator (markdown):
* Specific results needing interpretation
* Unexpected findings to validate
* Biological mechanisms to explore
10. Export (code)
- Save data, figures, session info
Biological Sanity Check Framework
Run these checks before accepting results:
Expression/Abundance Checks
- • Order of magnitude reasonable? (log2FC > 10 is suspicious)
- • Direction matches known biology? (check a few known genes)
- • Positive controls behave as expected?
- • Negative controls show no signal?
Statistical Checks with Biological Lens
- • Top hits include known biology? (literature validation)
- • Results robust to threshold changes?
- • Batch effects vs real biology separated?
- • Multiple testing appropriate for biology? (discovery vs validation)
Genomics-Specific
- • Chromosome names consistent? (chr1 vs 1)
- • Coordinates sensible? (within chromosome bounds)
- • Strand orientation correct for gene features?
- • Genome build consistent throughout?
Experimental Design
- • Sample size adequate for this effect size?
- • Replicates biological or technical?
- • Confounders identified and addressed?
- • Controls appropriate for this experiment type?
If any check fails: Document in notebook, flag for biologist-commentator review
Biological Context Templates
Template: Differential Expression Analysis
## Biological Context Comparing [condition A] vs [condition B] to identify genes involved in [biological process]. Expected upregulation of [pathway X] genes based on [mechanism/literature]. Positive controls: [gene1, gene2]. Expected log2FC range: [X-Y] based on [citation]. ## Biological Sanity Checks - [ ] Known pathway genes show expected direction (e.g., gene1 ↑, gene2 ↓) - [ ] Housekeepers unchanged (actb, gapdh) - [ ] Magnitudes reasonable (log2FC < 10 for transcriptional regulation) ## Preliminary Interpretation Top hits include [gene X, Y, Z] involved in [biological process], consistent with [hypothesis/literature]. [Gene W] unexpected - requires expert validation. **Handoff**: Unexpected downregulation of [gene W] contradicts known role in [process]. Biologist-commentator needed for mechanism assessment.
Template: Single-Cell Clustering
## Biological Context Clustering [tissue] cells to identify cell types. Expected populations: [celltype1 (markers: a,b,c), celltype2 (markers: d,e,f)]. Reference atlas: [citation if available]. ## Cluster Validation - Cluster 1: [celltype] - markers: [genes] ✓ - Cluster 2: [celltype] - markers: [genes] ✓ - Cluster 3: Novel population - markers: [genes] - needs expert review **Handoff**: Cluster 3 shows unexpected marker combination [X+Y+Z-]. Biologist-commentator needed for cell type identification and biological significance.
Template: Expert Handoff Format
Use this concise format when escalating to biologist-commentator:
## Expert Interpretation Needed **Finding**: [Specific result with statistics] **Context**: [1-2 sentence background] **Issue**: [What's unexpected/unclear and why] **Question**: [Specific question for expert] **Validation Done**: [Positive controls: ✓/✗, Literature: consistent/contradicts]
Example:
## Expert Interpretation Needed **Finding**: Gene X shows 8-fold upregulation (padj<0.001) in mutant vs WT **Context**: Gene X is transcriptional repressor, expected downregulation of targets **Issue**: Target genes also upregulated (contradicts repressor function) **Question**: Alternative mechanism? Post-transcriptional regulation? Data artifact? **Validation Done**: Positive controls ✓, replicates consistent ✓, literature shows conflicting results
Biologist-Commentator Integration Pattern
When to Invoke Biologist-Commentator
Pre-Analysis (Method Validation):
Skill(skill="biologist-commentator", args="Validate that DESeq2 appropriate for [specific experiment design]. Confirm controls adequate and confounders addressed.")
During Analysis (Quick Check):
- •Use biological sanity check framework (above)
- •Document any red flags
- •Continue if checks pass, escalate if fail
Post-Analysis (Expert Interpretation):
Skill(skill="biologist-commentator", args="Interpret biological significance of [specific finding]. Results show [X], which is [expected/unexpected]. Known biology suggests [Y]. Please validate interpretation and suggest mechanisms.")
Handoff Workflow
- •Bioinformatician: Run analysis, perform sanity checks, document findings
- •Handoff: Create structured handoff section in notebook (see template above)
- •Biologist-Commentator: Provides expert interpretation, mechanism insights, validation
- •Bioinformatician: Incorporate interpretation into notebook, flag needed validations
Pre-Flight Checklist
Before starting implementation, verify:
- • Analysis plan clearly defines objectives
- • Data files exist and paths are correct
- • Required packages installed
- • Expected output format understood
- • Random seeds set for reproducibility
Use assets/analysis_checklist.md for complete list.
Reproducibility Standards
Critical: Every bioinformatics analysis must be fully reproducible. Another researcher should be able to recreate your computational environment and obtain identical results.
Environment Documentation (Mandatory)
Start every notebook with environment documentation:
# %%
# Computational Environment
import sys
import numpy as np
import pandas as pd
import scanpy as sc # or relevant packages
print("=" * 60)
print("COMPUTATIONAL ENVIRONMENT")
print("=" * 60)
print(f"Python: {sys.version}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"Scanpy: {sc.__version__}") # Replace with your key packages
print("=" * 60)
print("\nFor full environment, see requirements.txt")
Create environment files before starting analysis:
# For conda users (recommended for bioinformatics): conda env export > environment.yml # For pip users: pip freeze > requirements.txt # Document which file to use in notebook
In notebook markdown cell:
## Computational Environment - **Kernel**: Python 3.11 (bio-analysis-env) - **Environment file**: `environment.yml` (recreate with `conda env create -f environment.yml`) - **Key packages**: scanpy==1.10.0, numpy==1.26.3, pandas==2.2.0, scipy==1.12.0 - **Execution date**: 2026-01-29
Random Seed Setting (Mandatory for Stochastic Processes)
Set seeds in setup cell:
# %%
# Random seeds for reproducibility
import numpy as np
import random
RANDOM_SEED = 42 # Document choice (convention, replicating published analysis, etc.)
# Core Python/NumPy
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
# Scanpy (single-cell analysis)
import scanpy as sc
sc.settings.seed = RANDOM_SEED
# PyTorch (if using deep learning)
import torch
torch.manual_seed(RANDOM_SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(RANDOM_SEED)
# TensorFlow (if using)
import tensorflow as tf
tf.random.set_seed(RANDOM_SEED)
print(f"Random seed set to {RANDOM_SEED} for reproducibility")
Bioinformatics operations requiring seeds:
- •Dimensionality reduction: UMAP, t-SNE, PCA with randomized SVD
- •Clustering: Leiden, Louvain (graph-based)
- •Sampling: Random subsampling, bootstrap, cross-validation
- •Imputation: Stochastic imputation methods
- •Simulation: Monte Carlo, permutation tests
- •Machine learning: Random forests, neural networks, k-means initialization
Document in notebook:
## Stochastic Operations This analysis uses: - UMAP (random initialization, seed=42) - Leiden clustering (random walk, seed=42) - 1000-iteration permutation test (seed=42) All seeds set to 42 for reproducibility.
Session Info Output (Mandatory)
End every notebook with comprehensive session info:
# %%
# Session Information for Reproducibility
import session_info
session_info.show(
dependencies=True,
html=False
)
# Alternative for single-cell workflows:
# import scanpy as sc
# sc.logging.print_versions()
# Alternative for base Python:
# import sys
# import pkg_resources
# print(f"Python: {sys.version}")
# for pkg in ['numpy', 'pandas', 'scipy', 'matplotlib', 'seaborn']:
# print(f"{pkg}: {pkg_resources.get_distribution(pkg).version}")
What this captures:
- •Python version
- •Operating system
- •All package versions (including dependencies)
- •Execution timestamp
Why this matters:
- •API changes between package versions
- •Statistical method implementations evolve
- •Bugs get fixed (results may change)
- •Reviewers need to verify methods
File Path Best Practices
Use relative paths and variables:
# %%
from pathlib import Path
# Define all paths at top of notebook
DATA_DIR = Path("data/raw")
PROCESSED_DIR = Path("data/processed")
RESULTS_DIR = Path("results/analysis_2026-01-29")
FIGURES_DIR = RESULTS_DIR / "figures"
# Create output directories
for directory in [PROCESSED_DIR, RESULTS_DIR, FIGURES_DIR]:
directory.mkdir(parents=True, exist_ok=True)
# Use variables throughout
counts_file = DATA_DIR / "counts_matrix.h5ad"
metadata_file = DATA_DIR / "sample_metadata.csv"
output_file = PROCESSED_DIR / "normalized_counts.h5ad"
figure_file = FIGURES_DIR / "umap_clusters.pdf"
print(f"Data directory: {DATA_DIR.resolve()}")
print(f"Results directory: {RESULTS_DIR.resolve()}")
Never use hardcoded absolute paths:
# ❌ BAD (non-reproducible):
adata = sc.read_h5ad("/Users/yourname/project/data/counts.h5ad")
plt.savefig("/Users/yourname/Desktop/figure.pdf")
# ✅ GOOD (reproducible):
adata = sc.read_h5ad(DATA_DIR / "counts.h5ad")
plt.savefig(FIGURES_DIR / "umap_clusters.pdf")
Data Provenance Documentation
Document data sources in notebook:
## Data Sources ### Input Data - **File**: `data/raw/GSE123456_counts.h5ad` - **Source**: GEO accession GSE123456 - **Download date**: 2026-01-15 - **Download command**: `wget https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE123456` - **Original publication**: Smith et al. (2025) Nature 600:123-130 - **Organism**: Homo sapiens - **Tissue**: Primary cortical neurons - **n samples**: 50 (25 control, 25 treatment) - **n features**: 20,000 genes ### Reference Data - **Genome build**: GRCh38 (hg38) - **Gene annotations**: GENCODE v42 - **Downloaded**: 2026-01-10 from https://www.gencodegenes.org/
Why this matters:
- •Data can be updated or removed from repositories
- •Genome builds affect coordinate-based analyses
- •Sample metadata clarifies experimental design
- •Enables others to download identical data
Reproducibility Pre-Flight Checklist
Before starting analysis, verify:
- • Environment documented (
environment.ymlorrequirements.txtexists) - • Environment creation documented in notebook
- • Random seeds will be set for all stochastic operations
- • File paths use variables (no hardcoded absolute paths)
- • Data sources documented (where to download, version, date)
- • Genome build / reference database versions specified
- • Session info cell will be added at end
Before handoff to PI, verify:
- • Notebook runs end-to-end without errors (Restart Kernel & Run All)
- • Results reproducible (run twice, identical outputs)
- • All figures saved to
FIGURES_DIRwith descriptive names - • All processed data saved to
PROCESSED_DIR - • Session info cell executed and output visible
- • Execution time reasonable (< 2 hours for routine analyses)
Integration with notebook-writer Skill
When creating notebooks programmatically, use notebook-writer skill with reproducibility standards:
from pathlib import Path
# Use notebook-writer to create template
cells = [
{'type': 'markdown', 'content': '## Computational Environment\n...'},
{'type': 'code', 'content': 'import sys\nprint(f"Python: {sys.version}")'},
{'type': 'markdown', 'content': '## Data Loading\n...'},
# ... analysis cells ...
{'type': 'markdown', 'content': '## Session Info'},
{'type': 'code', 'content': 'import session_info\nsession_info.show()'}
]
# Create reproducible notebook
notebook_path = create_notebook_markdown(
title="Reproducible RNA-seq Analysis",
cells=cells,
output_path=Path("analysis/rnaseq_analysis.md")
)
Common Reproducibility Failures and Fixes
| Issue | Problem | Fix |
|---|---|---|
| Different results on rerun | No random seed set | Set seeds for numpy, random, scanpy, torch |
| Import errors | Missing package versions | Create requirements.txt or environment.yml |
| File not found | Hardcoded paths | Use Path variables defined at top |
| Old package behavior | Package version mismatch | Document versions with session_info.show() |
| Data source vanished | URL changed or removed | Document download date, accession, mirror sites |
| Genome coordinate mismatch | Different genome build | Specify build (GRCh38 vs GRCh37) in notebook |
Bioinformatics-Specific Reproducibility Considerations
Organism and Reference Versions:
# Document in code cell
ORGANISM = "Homo sapiens"
GENOME_BUILD = "GRCh38" # or "mm39" for mouse, "dm6" for fly, etc.
ANNOTATION_VERSION = "GENCODE v42" # or "Ensembl 110"
ANNOTATION_DATE = "2026-01-10"
print(f"Analysis configuration:")
print(f" Organism: {ORGANISM}")
print(f" Genome: {GENOME_BUILD}")
print(f" Annotations: {ANNOTATION_VERSION} ({ANNOTATION_DATE})")
Bioinformatics Tools (if used):
## External Tools - **STAR aligner**: v2.7.11a (for read mapping) - **MACS2**: v2.2.9.1 (for peak calling) - **bedtools**: v2.31.0 (for interval operations) All tools available in conda environment (see environment.yml).
Data Processing Parameters:
# Document all filtering/QC thresholds
QC_PARAMS = {
'min_genes_per_cell': 200,
'min_cells_per_gene': 3,
'max_pct_mt': 15, # percent mitochondrial reads
'min_counts': 1000,
'highly_variable_genes': 2000,
'n_pcs': 50, # principal components
'umap_neighbors': 15,
'leiden_resolution': 0.8
}
print("Quality control parameters:")
for param, value in QC_PARAMS.items():
print(f" {param}: {value}")
Code Quality Standards
During Implementation
- •Copilot reviews continuously - expect adversarial feedback
- •Write clear comments explaining biological context
- •Use descriptive variable names
- •Modularize repeated operations into functions
- •Log progress for long-running analyses
Testing
- •Validate on small test data first
- •Check edge cases (empty data, single sample, all zeros)
- •Compare to expected results (positive controls)
- •Verify reproducibility (run twice, same results)
Common Analysis Patterns
Pattern 1: Differential Expression (RNA-seq)
# 1. Load counts # 2. Filter low-abundance genes # 3. Normalize (DESeq2, TMM, or library size) # 4. Statistical test (DESeq2, edgeR, limma) # 5. Multiple testing correction # 6. Volcano plot + heatmap
→ Use pydeseq2 skill for implementation details
Pattern 2: Single-Cell Analysis
# 1. Load AnnData object # 2. QC filtering (cells and genes) # 3. Normalization and log-transform # 4. Feature selection (highly variable genes) # 5. Dimensionality reduction (PCA, UMAP) # 6. Clustering # 7. Marker gene identification # 8. Visualization
→ Use scanpy skill for implementation details
Pattern 3: Sequence Analysis
# 1. Read FASTA/FASTQ # 2. Quality filtering # 3. Alignment or motif search # 4. Feature extraction # 5. Statistical summary
→ Use biopython skill for implementation details
References
For detailed guidance:
- •
references/analysis_workflows.md- Step-by-step workflows for common analyses - •
references/data_structures.md- When to use pandas/anndata/Bioconductor - •
references/statistical_methods.md- Which test for which data - •
references/visualization_best_practices.md- Plot selection and styling
Helper Scripts
Available in scripts/:
- •
qc_pipeline.py- Automated QC for RNA-seq data - •
differential_expression_template.py- Complete DESeq2 pipeline - •
data_loader_helpers.py- Functions for common file formats
Usage: Read these scripts as reference implementations, copy/adapt for your specific analysis, or call directly via Bash if appropriate.
Integration with Domain Skills
When analysis requires specialized knowledge:
| Data Type | Primary Skill | When to Use |
|---|---|---|
| Single-cell RNA-seq | scanpy | Cell type identification, clustering, trajectory |
| Bulk RNA-seq | pydeseq2 | Differential gene expression |
| Sequences | biopython | Alignment, motif search, format conversion |
| Statistical modeling | statsmodels | Regression, time series, GLMs |
| Pathway analysis | gseapy or manual | Gene set enrichment |
Pattern:
- •Use
bioinformaticianfor overall workflow - •Invoke specialized skill for domain-specific steps
- •Integrate results back into main analysis
Copilot Review Integration
During implementation, copilot skill reviews your code:
- •Expect critical feedback (adversarial but constructive)
- •Fix issues immediately before proceeding
- •Iterate until code is robust
- •Don't take criticism personally - it catches bugs early
Deliverables
Complete notebook should include:
Technical Components (existing):
- •Code cells: Well-commented, modular analysis
- •Visualizations: Publication-ready figures
- •Statistics: Complete reporting (test, p-value, effect size, n)
- •Exports: Processed data files, figure files
- •Session info: Package versions for reproducibility
Biological Components (new): 6. Biological Context Cells (markdown):
- •Research question in biological terms
- •Hypothesis and expected outcomes
- •Biological description of each analysis step
- •Relevance to biological question
- •
Sanity Check Documentation (markdown):
- •Results of biological plausibility checks
- •Positive/negative control validation
- •Known biology comparison
- •Red flags or concerns
- •
Preliminary Interpretation (markdown):
- •Main findings in biological language
- •Consistency with expectations
- •Novel or surprising results
- •Biological implications
- •
Expert Handoff Section (markdown, if needed):
- •Structured questions for biologist-commentator
- •Specific findings needing interpretation
- •Recommended follow-up analyses
- •Caveats and limitations
Quality Indicator: Notebook should be readable by biologist who doesn't code
Quality Indicators
Your notebook is ready when:
Technical Quality:
- • All code executes without errors
- • Random seed set, results reproducible
- • QC checks passed (positive controls work)
- • Visualizations properly labeled
- • Statistics completely reported
- • Copilot approved code (no outstanding critical issues)
Biological Quality:
- • Biological context provided for all major sections (concise, 1-3 sentences)
- • Biological sanity checks completed and documented
- • Positive/negative controls validated against biological expectations
- • Preliminary interpretation written in biological terms
- • Handoff to biologist-commentator structured (if unexpected findings)
- • Notebook readable by non-coding biologist
Integration Ready:
- • Ready for PI to expand interpretations for publication
- • Clear which findings are routine vs need expert review