AgentSkillsCN

systems-architect

当为生物信息学管道设计软件架构、定义数据结构、规划可扩展性,或为复杂系统做出技术设计决策时使用。

SKILL.md
--- frontmatter
name: systems-architect
version: 1.1
last_updated: 2026-01-29
description: Use when designing software architecture for bioinformatics pipelines, defining data structures, planning scalability, or making technical design decisions for complex systems.
success_criteria:
  - Architecture addresses all functional requirements
  - Scalability considerations documented and planned
  - Technology choices justified with trade-offs explained
  - Data structures appropriate for use case
  - APIs and interfaces clearly defined
  - Technical specification complete for implementation
  - Non-functional requirements addressed (performance, maintainability)
extended_thinking_budget: 8192-12288
metadata:
    skill-author: David Angeles Albores
    category: bioinformatics-workflow
    workflow: software-development
    integrates-with: [bioinformatician, biologist-commentator, software-developer]
    use_extended_thinking_for:
      - Complex architectural decisions with multiple trade-offs
      - Scalability planning for large-scale data processing
      - Technology stack selection with competing options
      - Design pattern selection for novel problem domains
allowed-tools: [Read, Write, Bash]

Systems Architect Skill

Purpose

Design robust, scalable architectures for bioinformatics software and pipelines.

When to Use This Skill

Use this skill when you need to:

  • Design software architecture for complex bioinformatics systems
  • Choose appropriate data structures (pandas, anndata, HDF5, databases)
  • Plan for scalability (memory, compute, storage)
  • Define APIs and interfaces between components
  • Design pipeline orchestration (Snakemake, Nextflow, custom)
  • Make technology stack decisions

Workflow Integration

Pattern: Requirements → Architecture Design → Implementation Spec

code
Biologist Commentator validates requirements
    ↓
Systems Architect designs architecture
    ↓
Produces technical specification
    ↓
Software Developer implements from spec

Core Responsibilities

1. System Design

  • Component architecture (modular, extensible)
  • Data flow design
  • Error handling strategy
  • Scalability planning

2. Technology Selection

  • Data structures (when to use what)
  • Storage formats (CSV, HDF5, Parquet, databases)
  • Execution environments (local, HPC, cloud)
  • Pipeline orchestration tools

3. Performance Planning

  • Memory requirements estimation
  • Compute resource allocation
  • I/O optimization strategies
  • Parallelization approach

4. Integration Strategy

  • How to wrap existing tools
  • Container strategy (Docker/Singularity)
  • Dependency management
  • Version pinning

Standard Architecture Template

Use assets/architecture_template.md:

code
# System Architecture: [Project Name]

## Overview
[1-2 sentence system description]

## Components
1. [Component Name]: [Purpose]
2. [Component Name]: [Purpose]

## Data Flow
[Input] → [Processing] → [Output]

## Technology Stack
- Language: Python 3.11
- Key Libraries: pandas, numpy, scikit-learn
- Storage: HDF5 for matrices, SQLite for metadata
- Execution: Snakemake on HPC cluster

## Scalability
- Dataset size: [Expected range]
- Memory: [Requirements]
- Compute: [CPU cores, time estimates]
- Storage: [Space requirements]

## Error Handling
[Strategy for failures, retries, logging]

## Deployment
[Installation, configuration, execution]

Data Structure Selection Guide

See references/data_structure_guide.md for full details.

Quick Reference:

Use CaseStructureWhen
Tabular data <1GBpandas DataFrameGeneral analysis
Tabular data >1GBDask DataFrameOut-of-core processing
Single-cell dataAnnDatascRNA-seq analysis
Large matricesHDF5Persistent storage
Relational queriesSQLite/PostgreSQLComplex joins
Genomic intervalsBED/GFF filesStandard interchange
Time seriespandas with DatetimeIndexTemporal data

Scalability Considerations

Memory Estimation

code
RNA-seq count matrix: genes × samples × 8 bytes
  20,000 genes × 1,000 samples × 8 = 160 MB (fits in RAM)
  20,000 genes × 100,000 cells × 8 = 16 GB (need sparse or chunking)

Compute Planning

code
DESeq2 analysis: O(n_genes × n_samples²)
  100 samples: ~5 minutes
  1,000 samples: ~8 hours
  Strategy: Subset for testing, full run overnight

Storage Planning

code
FASTQ (compressed): 50-100 MB per million reads
  50M reads = 5 GB
  100 samples × 50M reads = 500 GB
  Strategy: Delete FASTQ after alignment, keep BAM

Integration Patterns

Wrapping External Tools

python
# Pattern 1: Subprocess call
import subprocess
result = subprocess.run(
    ['fastqc', input_file, '-o', output_dir],
    capture_output=True, check=True
)

# Pattern 2: Python binding (preferred if available)
import pysam
bam = pysam.AlignmentFile(bam_file, 'rb')

Container Strategy

yaml
# Dockerfile approach for reproducibility
FROM python:3.11-slim
RUN pip install numpy pandas scikit-learn
COPY pipeline.py /app/
ENTRYPOINT ["python", "/app/pipeline.py"]

Output: Technical Specification

Deliverable to Software Developer includes:

  1. Architecture diagram (components + data flow)
  2. Component specifications (inputs, outputs, responsibilities)
  3. Technology stack (exact versions)
  4. Data structures (schemas, formats)
  5. Error handling (what to do when steps fail)
  6. Performance requirements (memory, time, storage)
  7. Testing strategy (unit, integration, validation)

References

For detailed guidance:

  • references/architecture_patterns.md - Common patterns with pros/cons
  • references/data_structure_guide.md - When to use which data structure
  • references/scalability_considerations.md - Memory, compute, storage planning
  • references/integration_patterns.md - How to wrap tools, containers, dependencies

Example Architecture

Project: QC Pipeline for 1,000 RNA-seq Samples

code
## Architecture Specification

### Overview
Parallel QC pipeline processing 1,000 bulk RNA-seq FASTQ files with automated report generation.

### Components
1. Validator: Check FASTQ integrity, format
2. QC Runner: Execute FastQC in parallel
3. Aggregator: Combine metrics with MultiQC
4. Reporter: Generate summary statistics and plots

### Data Flow
FASTQ files → Validator → QC Runner (parallel) → Aggregator → HTML Report

### Technology Stack
- Execution: Snakemake (manages dependencies, parallelization)
- QC: FastQC 0.12.1
- Aggregation: MultiQC 1.14
- Custom code: Python 3.11, pandas, matplotlib
- Storage: FASTQ (gzip), QC metrics (JSON), report (HTML)

### Scalability
- Data: 1,000 samples × 50M reads × 100 bp = 500 GB FASTQ
- Compute: 100 parallel jobs on HPC cluster
- Time: 30 min per sample → 300 min total (5 hours)
- Memory: 4 GB per FastQC job = 400 GB total (distributed)

### Error Handling
- Retry failed jobs (3 attempts)
- Continue pipeline if individual samples fail
- Log all errors with sample ID
- Final report includes QC pass/fail status per sample

### Deployment
- Install: conda env from environment.yml
- Config: samples.csv (list of FASTQ paths)
- Execute: snakemake --cores 100 --cluster "sbatch -c 4 --mem=4GB"
- Output: results/multiqc_report.html

Hands to Software Developer for implementation.

Success Criteria

Architecture is complete when:

  • All components clearly defined
  • Data flow unambiguous
  • Technology choices justified
  • Scalability analyzed (memory, compute, storage)
  • Error handling planned
  • Developer can implement without architecture questions