CDR3Clustering Process Configuration

Purpose

Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds CDR3_Cluster column to metadata for clonotype analysis.

When to Use

•To identify groups of similar TCR/BCR clonotypes
•For analyzing TCR sequence convergence
•After ScRepCombiningExpression when TCR/BCR integrated with RNA
•For investigating public clonotypes across samples
•Before TESSA analysis for epitope specificity

Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).

Configuration Structure

Process Enablement

toml

[CDR3Clustering]
cache = true

Input Specification

toml

[CDR3Clustering.in]
screpfile = "path/to/combined_object.qs"

Environment Variables

toml

[CDR3Clustering.envs]
type = "auto"      # TCR, BCR, or auto
tool = "GIANA"     # GIANA or ClusTCR
python = "python"   # Path to python
within_sample = true  # Cluster per sample
args = {}          # Tool-specific arguments
chain = "both"     # TRA, TRB, IGH, IGL, IGK, both, heavy, light

GIANA Arguments (via `args`)

toml

[CDR3Clustering.envs.args]
method = "hierarchical"    # hierarchical, kmeans
dist = "hamming"          # hamming, levenshtein
threshold = 0.15           # Distance threshold

ClusTCR Arguments (via `args`)

toml

[CDR3Clustering.envs.args]
method = "two-step"       # mcl, faiss, two-step
n_cpus = 4                # CPUs for MCL
faiss_cluster_size = 5000  # Supercluster size
mcl_params = [1.2, 2]    # [inflation, expansion]

Configuration Examples

Minimal Configuration

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

GIANA with Custom Distance Threshold

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
method = "hierarchical"
dist = "hamming"
threshold = 0.15

ClusTCR Two-Step (Large Datasets)

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

ClusTCR MCL (Small Datasets)

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "mcl"
n_cpus = 4

TRB Chain Only

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
chain = "TRB"

Cross-Sample Clustering

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
within_sample = false

Common Patterns

Pattern 1: Standard TCR Beta Chain

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
type = "TCR"
tool = "GIANA"
chain = "TRB"

Pattern 2: Large Dataset (>100K sequences)

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "ClusTCR"

[CDR3Clustering.envs.args]
method = "two-step"
faiss_cluster_size = 5000
n_cpus = 8

Pattern 3: Custom Threshold

toml

[CDR3Clustering]
[CDR3Clustering.in]
screpfile = "intermediate/screpcombiningexpression/combined.qs"

[CDR3Clustering.envs]
tool = "GIANA"

[CDR3Clustering.envs.args]
threshold = 0.15  # Higher=fewer clusters, Lower=more clusters

Dependencies

Upstream

•ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data

Downstream

•TESSA: TCR epitope specificity prediction
•ClonalStats: Clonality statistics (uses CDR3_Cluster metadata)

Validation Rules

•Tool must be "GIANA" or "ClusTCR"
•Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
•GIANA requires: biopython, faiss, scikit-learn
•ClusTCR requires: clustcr package

Computational Considerations

•<50K sequences: ClusTCR method = "mcl" (highest quality)
•50K-500K sequences: ClusTCR method = "two-step" (balanced)
•

500K sequences: GIANA or ClusTCR method = "two-step" (fastest)
•Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
•Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K

Troubleshooting

Process not running

Cause: No VDJ data available Solution: Verify ScRepCombiningExpression output contains TCR/BCR data

ModuleNotFoundError

Cause: Missing dependencies Solution:

•GIANA: pip install biopython faiss-cpu scikit-learn
•ClusTCR: conda install -c conda-forge clustcr

Too many/few clusters

Cause: Threshold inappropriate Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)

Out of memory

Cause: Dataset too large for RAM Solution: Use within_sample = true, reduce n_cpus, or use GIANA

Slow clustering

Cause: Suboptimal method for dataset size Solution:

•

50K: ClusTCR method = "two-step" with increased n_cpus
•Very large (>500K): Use GIANA

Notes on Output Format

Metadata column: CDR3_Cluster

Cluster naming:

•S_1, S_2: Single unique CDR3 sequence (may have multiple cells)
•M_1, M_2: Multiple unique CDR3 sequences (similar but different)

Interpretation:

•S_ prefix: Cells share identical CDR3 sequence
•M_ prefix: Cells have similar but different CDR3 sequences
•Use CDR3_Cluster as grouping factor in Seurat plots

Performance Tips:

•Small (<10K): GIANA defaults (quality over speed)
•Medium (10K-100K): ClusTCR two-step with n_cpus=4
•Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
•Very large (>1M): GIANA with increased faiss_cluster_size

CDR3Clustering Process Configuration

Purpose

When to Use

Configuration Structure

Process Enablement

Input Specification

Environment Variables

GIANA Arguments (via args)

ClusTCR Arguments (via args)

Configuration Examples

Minimal Configuration

GIANA with Custom Distance Threshold

ClusTCR Two-Step (Large Datasets)

ClusTCR MCL (Small Datasets)

TRB Chain Only

Cross-Sample Clustering

Common Patterns

Pattern 1: Standard TCR Beta Chain

Pattern 2: Large Dataset (>100K sequences)

Pattern 3: Custom Threshold

Dependencies

Upstream

Downstream

Validation Rules

Computational Considerations

Troubleshooting

Process not running

ModuleNotFoundError

Too many/few clusters

Out of memory

Slow clustering

Notes on Output Format

GIANA Arguments (via `args`)

ClusTCR Arguments (via `args`)