Differential Abundance Analysis Guide

This skill provides evidence-based guidance for method selection, threshold choices, and result interpretation based on comprehensive benchmarking.

Quick Method Selection

Sparsity	Method	Threshold	Sensitivity	FDR	Use Case
>70%	Hurdle	q < 0.05	83%	25%	Sparse data with structural zeros
50-70%	ZINB	q < 0.05	83%	29%	Moderate sparsity, excess zeros
<30%	LinDA	q < 0.10	39%	12.5%	Low sparsity, high-confidence findings
Longitudinal	LMM	q < 0.05	-	-	Repeated measures

CRITICAL: LinDA Threshold

LinDA requires q < 0.10, NOT q < 0.05

LinDA uses CLR (Centered Log-Ratio) transformation which attenuates effect sizes by ~75%:

•True 4.0 log2FC (16x fold change) becomes ~1.0 observed
•At q < 0.05: 0% sensitivity (nothing detected)
•At q < 0.10: 39% sensitivity with excellent 12.5% FDR

This is by design, not a bug. CLR centers by geometric mean to handle compositional data.

When User Gets 0 Significant Results with LinDA

•First, check if they used q < 0.05 → suggest q < 0.10
•If still nothing at q < 0.10, the effects may be too small
•LinDA needs >8x fold changes to detect anything reliably
•Suggest trying Hurdle if FDR control is less critical

Effect Size Requirements

Method	Minimum Detectable Effect	Notes
LinDA	>8x fold change (3 log2FC)	Due to CLR attenuation
ZINB	>2x fold change (1 log2FC)	Good sensitivity
Hurdle	>2x fold change (1 log2FC)	Best for sparse data
NB	>16x fold change (4 log2FC)	Conservative

Method Details

LinDA (CLR + Linear Model)

•Best for: High-confidence findings, FDR-controlled discovery
•Threshold: q < 0.10 (NOT 0.05)
•Pros: Excellent FDR control (12.5%), handles compositionality
•Cons: Low sensitivity, only detects very large effects
•Effect sizes: NOT directly interpretable as fold changes (attenuated by ~75%)

ZINB (Zero-Inflated Negative Binomial)

•Best for: Discovery, count data with excess zeros
•Threshold: q < 0.05
•Pros: High sensitivity (83%), models zero-inflation
•Cons: Moderate FDR (29%), assumes NB distribution

Hurdle Model

•Best for: Sparse data with structural zeros, two-part analysis
•Threshold: q < 0.05
•Pros: High sensitivity (83%), good FDR (25%), separates presence/abundance
•Cons: More complex interpretation (binary + count components)

NB (Negative Binomial)

•Best for: Low-sparsity data, simple overdispersion
•Threshold: q < 0.05
•Pros: Simple, well-understood
•Cons: Low sensitivity (6%), doesn't handle excess zeros

Permutation Test

•Best for: Unknown distributions, non-parametric inference
•Threshold: p < 0.05
•Pros: Distribution-free, robust
•Cons: Computationally intensive, may be conservative

LMM (Linear Mixed Model)

•Best for: Longitudinal data, repeated measures
•Threshold: q < 0.05
•Pros: Handles within-subject correlation, auto-detected by recommend
•Cons: Requires CLR transformation, same attenuation as LinDA

Interpreting Results

LinDA Results

•Effect sizes are CLR-transformed, NOT fold changes
•Observed estimate of 1.0 may represent true 4x fold change
•Focus on significance (q-value), not effect magnitude
•Use q < 0.10 threshold

ZINB/Hurdle Results

•Effect sizes are on log scale (interpretable as log fold change)
•fold_change = exp(estimate)
•Higher sensitivity means more discoveries but also more false positives
•Consider biological plausibility of findings

Sample Size Considerations

•n=10 per group: Only huge effects (>8x) reliably detectable
•n=20 per group: Large effects (>4x) detectable
•n=50 per group: Moderate effects (>2x) become detectable
•Power analysis recommended before study

Decision Tree

code

Is your data longitudinal/repeated measures?
├── YES → recommend auto-detects and runs LMM
└── NO → Continue...

What is your sparsity level?
├── >70% zeros → Hurdle (q < 0.05)
├── 50-70% zeros → ZINB (q < 0.05)
├── 30-50% zeros → ZINB or LinDA
└── <30% zeros → LinDA (q < 0.10)

Unsure about distributional assumptions?
└── Permutation test (p < 0.05)

CLI Quick Reference

The unified workflow handles method selection automatically:

bash

# Let the tool choose the best method for your data
daa recommend -c counts.tsv -m metadata.tsv -g group -t treatment --run -o results.tsv

# Just see the recommendation (no execution)
daa recommend -c counts.tsv -m metadata.tsv -g group -t treatment

# Generate editable YAML for custom configurations
daa recommend -c counts.tsv -m metadata.tsv -g group -t treatment --yaml -o pipeline.yaml

# Non-parametric alternative
daa permutation -c counts.tsv -m metadata.tsv -f "~ group" -t grouptreatment -o results.tsv

Longitudinal/Repeated Measures

The recommend command auto-detects these designs:

bash

# Metadata with subject + timepoint columns → LMM automatically
daa recommend -c counts.tsv -m metadata.tsv -g group -t treatment --run -o results.tsv

# For custom formulas (interactions, random slopes):
daa recommend -c counts.tsv -m metadata.tsv -g group -t treatment --yaml -o pipeline.yaml
# Edit pipeline.yaml, then:
daa run -c counts.tsv -m metadata.tsv --config pipeline.yaml -o results.tsv

Troubleshooting: 0 Significant Features

•Check threshold: Did you use LinDA with q < 0.05? Try q < 0.10
•Check sample size: n < 20/group has very limited power
•Check sparsity: >70% sparsity with LinDA? Try Hurdle instead
•Run validation: daa validate to see if method works on your data structure
•Check raw p-values: Are any features close (p < 0.1)?

For More Details

•Method comparison benchmarks: see method-comparison.md
•Q-value threshold analysis: see thresholds.md
•Result interpretation guide: see interpretation.md