CDR3Clustering Process Configuration
Purpose
Cluster TCR/BCR clones by CDR3 sequences using GIANA or ClusTCR (both Faiss-based). Adds CDR3_Cluster column to metadata for clonotype analysis.
When to Use
- •To identify groups of similar TCR/BCR clonotypes
- •For analyzing TCR sequence convergence
- •After ScRepCombiningExpression when TCR/BCR integrated with RNA
- •For investigating public clonotypes across samples
- •Before TESSA analysis for epitope specificity
Important: Only runs when VDJ input present (TCRData/BCRData columns in SampleInfo).
Configuration Structure
Process Enablement
[CDR3Clustering] cache = true
Input Specification
[CDR3Clustering.in] screpfile = "path/to/combined_object.qs"
Environment Variables
[CDR3Clustering.envs]
type = "auto" # TCR, BCR, or auto
tool = "GIANA" # GIANA or ClusTCR
python = "python" # Path to python
within_sample = true # Cluster per sample
args = {} # Tool-specific arguments
chain = "both" # TRA, TRB, IGH, IGL, IGK, both, heavy, light
GIANA Arguments (via args)
[CDR3Clustering.envs.args] method = "hierarchical" # hierarchical, kmeans dist = "hamming" # hamming, levenshtein threshold = 0.15 # Distance threshold
ClusTCR Arguments (via args)
[CDR3Clustering.envs.args] method = "two-step" # mcl, faiss, two-step n_cpus = 4 # CPUs for MCL faiss_cluster_size = 5000 # Supercluster size mcl_params = [1.2, 2] # [inflation, expansion]
Configuration Examples
Minimal Configuration
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs"
GIANA with Custom Distance Threshold
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "GIANA" [CDR3Clustering.envs.args] method = "hierarchical" dist = "hamming" threshold = 0.15
ClusTCR Two-Step (Large Datasets)
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "two-step" faiss_cluster_size = 5000 n_cpus = 8
ClusTCR MCL (Small Datasets)
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "mcl" n_cpus = 4
TRB Chain Only
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] chain = "TRB"
Cross-Sample Clustering
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] within_sample = false
Common Patterns
Pattern 1: Standard TCR Beta Chain
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] type = "TCR" tool = "GIANA" chain = "TRB"
Pattern 2: Large Dataset (>100K sequences)
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "ClusTCR" [CDR3Clustering.envs.args] method = "two-step" faiss_cluster_size = 5000 n_cpus = 8
Pattern 3: Custom Threshold
[CDR3Clustering] [CDR3Clustering.in] screpfile = "intermediate/screpcombiningexpression/combined.qs" [CDR3Clustering.envs] tool = "GIANA" [CDR3Clustering.envs.args] threshold = 0.15 # Higher=fewer clusters, Lower=more clusters
Dependencies
Upstream
- •ScRepCombiningExpression (required): Combined scRepertoire object with TCR/BCR data
Downstream
- •TESSA: TCR epitope specificity prediction
- •ClonalStats: Clonality statistics (uses
CDR3_Clustermetadata)
Validation Rules
- •Tool must be
"GIANA"or"ClusTCR" - •Chain must be valid for data type (TCR: TRA/TRB, BCR: IGH/IGL/IGK)
- •GIANA requires: biopython, faiss, scikit-learn
- •ClusTCR requires: clustcr package
Computational Considerations
- •<50K sequences: ClusTCR
method = "mcl"(highest quality) - •50K-500K sequences: ClusTCR
method = "two-step"(balanced) - •
500K sequences: GIANA or ClusTCR
method = "two-step"(fastest) - •Memory: GIANA ~2-4 GB/100K, ClusTCR ~4-8 GB/100K
- •Runtime: GIANA 1-5 min/100K, ClusTCR two-step 2-10 min/100K
Troubleshooting
Process not running
Cause: No VDJ data available Solution: Verify ScRepCombiningExpression output contains TCR/BCR data
ModuleNotFoundError
Cause: Missing dependencies Solution:
- •GIANA:
pip install biopython faiss-cpu scikit-learn - •ClusTCR:
conda install -c conda-forge clustcr
Too many/few clusters
Cause: Threshold inappropriate Solution: Adjust threshold (higher = fewer clusters, lower = more clusters)
Out of memory
Cause: Dataset too large for RAM
Solution: Use within_sample = true, reduce n_cpus, or use GIANA
Slow clustering
Cause: Suboptimal method for dataset size Solution:
- •
50K: ClusTCR
method = "two-step"with increased n_cpus - •Very large (>500K): Use GIANA
Notes on Output Format
Metadata column: CDR3_Cluster
Cluster naming:
- •
S_1,S_2: Single unique CDR3 sequence (may have multiple cells) - •
M_1,M_2: Multiple unique CDR3 sequences (similar but different)
Interpretation:
- •
S_prefix: Cells share identical CDR3 sequence - •
M_prefix: Cells have similar but different CDR3 sequences - •Use
CDR3_Clusteras grouping factor in Seurat plots
Performance Tips:
- •Small (<10K): GIANA defaults (quality over speed)
- •Medium (10K-100K): ClusTCR two-step with n_cpus=4
- •Large (100K-1M): ClusTCR two-step with n_cpus=8+ or GIANA
- •Very large (>1M): GIANA with increased faiss_cluster_size