AgentSkillsCN

daa-diagnose

诊断微生物组/病毒组计数数据,并推荐差异丰度分析方法。当用户希望分析自身数据、询问针对特定数据集应选用何种方法,或提供计数数据文件以供分析时,可使用此技能。

SKILL.md
--- frontmatter
name: daa-diagnose
description: Diagnose microbiome/virome count data and recommend differential abundance analysis methods. Use when user wants to analyze their data, asks which method to use for their specific dataset, or provides count data files for analysis.
argument-hint: "[counts.tsv] [metadata.tsv] [group_column]"
allowed-tools: Bash, Read, Glob

DAA Data Diagnosis Workflow

This skill diagnoses user data and recommends appropriate differential abundance analysis methods based on empirical benchmarks.

Step 1: Locate Data Files

If the user provided file paths, use those. Otherwise, search for common patterns:

bash
# Look for count/abundance files
ls -la *.tsv *.csv 2>/dev/null | head -20

# Common naming patterns
ls -la *count* *abundance* *otu* *asv* *feature* 2>/dev/null | head -10

Ask the user to confirm which file is the count matrix and which is metadata if unclear.

Step 2: Profile the Data

Run the LLM-friendly profiler to get structured diagnostics:

bash
daa profile-llm -c {COUNTS_FILE} -m {METADATA_FILE} -g {GROUP_COLUMN}

If the group column is unknown, first inspect the metadata:

bash
head -5 {METADATA_FILE}

Step 3: Parse Profile Output

The profile-llm command outputs structured sections. Extract key metrics:

Critical Metrics for Method Selection

MetricWhere to FindDecision Impact
Overall sparsitysparsity.overall_zero_fraction>70% → Hurdle; 30-70% → ZINB
Samples per groupsamples section<20/group → warn about power
Library size CVlibrary_size.cv>1.0 → check for confounding
Featuresdimensions>500 → more stringent correction
Group-specific prevalenceprevalence sectionVaries by group → consider groupwise filtering

Step 4: Apply Decision Rules

Based on the profile, apply these evidence-based rules:

Primary Method Selection

code
IF sparsity > 70%:
    → Recommend HURDLE (best for structural zeros)

ELIF sparsity 50-70%:
    → Recommend ZINB (handles excess zeros well)

ELIF sparsity 30-50%:
    → Recommend ZINB or NB

ELSE (sparsity < 30%):
    → Recommend LinDA or NB (standard methods work)

Sample Size Considerations

code
IF n_per_group < 10:
    → WARN: Very low power, only huge effects detectable
    → Consider pooling or different study design

ELIF n_per_group 10-20:
    → WARN: Limited power, need >4x fold changes
    → Recommend ZINB/Hurdle for discovery

ELIF n_per_group 20-50:
    → Moderate power, >2x fold changes detectable
    → Full method suite available

ELSE (n_per_group > 50):
    → Good power, even moderate effects detectable

Library Size Variation

code
IF library_size_cv > 1.5:
    → WARN: High library size variation
    → Check for batch effects or technical issues
    → Ensure proper normalization

IF library_size differs by group (>2x):
    → WARN: Potential confounding
    → Consider library size as covariate

Study Design Detection

Inspect metadata for study design indicators:

bash
head -5 {METADATA_FILE}

Look for these columns:

Column PatternStudy DesignRecommended Model
subject, patient, individual, idRepeated measuresLMM with random intercept
time, timepoint, visit, dayLongitudinalLMM with time + random intercept
batch, run, plate, laneBatch effectsInclude as covariate or random effect
age, bmi, sexCovariatesInclude in formula

Longitudinal Data Rules

code
IF has_subject_column AND has_time_column:
    → Study type: Longitudinal
    → Recommend: LMM with random intercept per subject
    → Formula: "~ group * time + (1 | subject)"

IF has_subject_column AND NOT has_time_column:
    → Study type: Repeated measures (paired)
    → Recommend: LMM with random intercept per subject
    → Formula: "~ group + (1 | subject)"

IF multiple_samples_per_subject:
    → MUST use LMM to account for non-independence
    → Standard models will have inflated Type I error

Covariate Recommendations

code
IF has_batch_column:
    → Always include batch in model
    → Consider as random effect if many batches: (1 | batch)
    → Or fixed effect if few batches: ~ group + batch

IF has_continuous_covariates (age, bmi, etc.):
    → Include if potentially confounding
    → Formula: "~ group + age + bmi"

IF library_size_imbalance > 2x AND correlated_with_group:
    → Include log(library_size) as covariate
    → Formula: "~ group + log_library_size"

Step 5: Generate Recommendations

Output a structured recommendation:

Template Output

code
## Data Diagnosis Summary

**Dataset**: {n_features} features × {n_samples} samples
**Sparsity**: {sparsity}%
**Study design**: {cross-sectional | longitudinal | repeated-measures}
**Samples per group**: {group_sizes}
**Library size CV**: {cv}
**Covariates detected**: {list of potential covariates}

## Recommended Analysis

### Primary Method: {METHOD}
Rationale: {why this method based on data characteristics}

### Formula: {formula}
{If longitudinal: explain random effects}
{If covariates: explain why included}

```bash
daa recommend -c {counts} -m {metadata} -g {group_column} -t {target_level} --run -o results.tsv

Threshold: {q_threshold}

  • For {METHOD}: use q < {threshold} {If LinDA: explain q < 0.10 requirement}

Secondary Validation (Optional)

Run spike-in validation to empirically test the method on your data:

bash
daa validate -c {counts} -m {metadata} -g {group_column} -t {target_level} -f "{formula}" --test-coef {coefficient}

Warnings

{Any warnings about power, confounding, non-independence, etc.}

Expected Performance

Based on benchmarks with similar data:

  • Sensitivity: ~{X}% at {effect_size} fold change
  • FDR: ~{Y}% at q < {threshold}
code

## Step 6: Offer to Run Analysis

After presenting recommendations, offer:

1. "Should I run the recommended analysis now?"
2. "Would you like me to run multiple methods for comparison?"
3. "Do you want to adjust the prevalence filter threshold?"

## Decision Rules Reference

See [decision-rules.md](decision-rules.md) for complete decision tree and benchmark data.

## Example Sessions

### Example 1: Cross-sectional Study

**User**: "Analyze my microbiome data in data/counts.tsv"

**Claude**:
1. Runs `daa profile-llm -c data/counts.tsv -m data/metadata.tsv -g treatment`
2. Sees: 68% sparsity, n=25/group, CV=0.8
3. Recommends: "Based on 68% sparsity and n=25/group, I recommend **Hurdle model** at q < 0.05"
4. Provides ready-to-run command
5. Offers to execute

### Example 2: Longitudinal Study

**User**: "Analyze my longitudinal microbiome study"

**Claude**:
1. Inspects metadata: sees `subject_id`, `timepoint`, `treatment` columns
2. Detects: Longitudinal design with repeated measures
3. Runs `daa profile-llm -c counts.tsv -m metadata.tsv -g treatment`
4. Sees: 55% sparsity, n=30 subjects, 3 timepoints each
5. Recommends:

Study design: Longitudinal (3 timepoints per subject)

Primary Method: LMM (Linear Mixed Model) Rationale: Repeated measures require random effects to account for within-subject correlation. Moderate sparsity allows CLR transformation.

Formula: "~ treatment + timepoint + (1 | subject_id)"

  • treatment: main effect of interest
  • timepoint: time effect
  • (1 | subject_id): random intercept per subject

The recommend command auto-detects longitudinal designs and runs LMM: daa recommend -c counts.tsv -m metadata.tsv -g treatment -t case --run -o results.tsv

For custom formulas (interactions, random slopes), use --yaml: daa recommend -c counts.tsv -m metadata.tsv -g treatment -t case --yaml -o pipeline.yaml

Edit to add: ~ treatment * timepoint + (1 + timepoint | subject_id)

daa run -c counts.tsv -m metadata.tsv --config pipeline.yaml -o results.tsv

code
6. Warns: "LMM accounts for within-subject correlation to avoid inflated Type I error"

### Example 3: Study with Batch Effects

**User**: "My data has batch effects from different sequencing runs"

**Claude**:
1. Inspects metadata: sees `batch`, `group` columns
2. Detects: 4 batches, potential confounding
3. Recommends:

Covariates detected: batch (4 levels)

Formula: "~ group + batch"

  • Include batch as fixed effect (few levels)

Alternative if many batches: Formula: "~ group + (1 | batch)"

  • Include batch as random effect
code