CellTypeAnnotation Process Configuration
Purpose
Annotates cell clusters with biological cell type labels using multiple methods: direct assignment, ScType, scCATCH, hitype, or CellTypist. This process is essential for interpreting clustering results by assigning meaningful biological identities to each cluster.
When to Use
- •After clustering: When you have cluster assignments but need biological cell type labels
- •Automated annotation: When manual annotation is too time-consuming or subjective
- •Consistent nomenclature: When you need standardized cell type names across multiple samples
- •Reference-based annotation: When you have well-characterized reference datasets or marker databases
- •Cross-sample comparison: When analyzing multiple samples with the same cell type definitions
- •Alternative to SeuratMap2Ref: When you prefer database-based annotation over reference dataset mapping
Configuration Structure
Process Enablement
[CellTypeAnnotation] cache = true # Cache results for faster re-runs
Input Specification
[CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] # Path or reference to Seurat object
Environment Variables
Core Parameters
[CellTypeAnnotation.envs] # Annotation method selection tool = "direct" # Options: "direct", "sctype", "hitype", "sccatch", "celltypist" # Cluster identity column (required for h5ad input, optional for Seurat objects) ident = "seurat_clusters" # Column name in metadata representing clusters # Backup column name (stores original cluster labels) backup_col = "seurat_clusters_id" # Default: "seurat_clusters_id" # New column name for annotated cell types # If specified, original identity is kept; otherwise, it's replaced newcol = "" # Default: empty (overwrite identity) # Merge clusters with same predicted cell types merge = false # Default: false; suffixes (.1, .2) added for duplicate labels # Output file type outtype = "input" # Options: "input", "rds", "qs", "qs2", "h5ad"
Direct Annotation Parameters
[CellTypeAnnotation.envs]
tool = "direct"
# Cell type assignments (one per cluster, in order)
# Use "-" or "" to keep original cluster name
# Use "NA" to remove cluster from downstream analysis (only without newcol)
cell_types = ["CD4+ T cells", "CD8+ T cells", "-", "B cells"] # Default: []
# Additional annotations (multiple cell type columns)
more_cell_types = { # Dict: {new_column: [cell_types]}
cell_type_broad = ["T cells", "T cells", "NK cells", "B cells"],
cell_type_detailed = ["CD4+ naive", "CD8+ effector", "NK", "B naive"]
}
ScType Annotation Parameters
[CellTypeAnnotation.envs] tool = "sctype" # Tissue type (must match tissueType column in database) sctype_tissue = "Immune system" # Required for sctype # Database file path (Excel format compatible with ScType) sctype_db = "/path/to/ScTypeDB_full.xlsx" # Optional: uses default if not specified
hitype Annotation Parameters
[CellTypeAnnotation.envs] tool = "hitype" # Tissue type (must match tissueType column in database) hitype_tissue = "Immune system" # Required for hitype # Database file path or built-in database name # Built-in options: "hitypedb_short", "hitypedb_full", "hitypedb_pbmc3k" hitype_db = "hitypedb_full" # Default: built-in database
scCATCH Annotation Parameters
[CellTypeAnnotation.envs] tool = "sccatch" [CellTypeAnnotation.envs.sccatch_args] # Species (Human or Mouse) species = "Human" # Required # Tissue origin tissue = "Blood" # Required # Cancer type (if cancer tissue) cancer = "Normal" # Default: "Normal" # Custom marker genes (RDS file or list) marker = "" # Optional # Use custom marker instead of database if_use_custom_marker = false # Default: false # Additional scCATCH::findmarkergene() arguments # See: https://rdrr.io/cran/scCATCH/man/findmarkergene.html
CellTypist Annotation Parameters
[CellTypeAnnotation.envs] tool = "celltypist" [CellTypeAnnotation.envs.celltypist_args] # Model file path (download from https://celltypist.cog.sanger.ac.uk/models/models.json) model = "Immune_All_Low.pkl" # Required # Python interpreter where celltypist is installed python = "python" # Default: "python" # Majority voting refinement for local subclusters majority_voting = true # Default: true # Over-clustering column (for majority voting) # Set to false to disable over-clustering over_clustering = "seurat_clusters" # Auto: identity for Seurat, false for h5ad # Assay for Seurat-to-AnnData conversion assay = "" # Auto: RNA for h5seurat, default assay for Seurat
Annotation Methods
1. Direct Annotation
Assigns cell types manually to each cluster. Best when you have well-defined marker genes or want complete control over annotations.
Pros:
- •Full control over annotations
- •Fast and deterministic
- •Works with any clustering result
Cons:
- •Requires domain knowledge
- •Time-consuming for many clusters
- •Subjective
Use cases:
- •Small number of well-separated clusters
- •Known marker genes
- •Reproducible annotation needed
2. ScType
Uses pre-defined cell type markers from ScType database. Annotates based on enrichment of known marker genes in each cluster.
Databases:
- •ScTypeDB_short.xlsx: Compact database (~70 cell types)
- •ScTypeDB_full.xlsx: Full database (~200+ cell types)
- •Custom database: Provide your own Excel file
Pros:
- •Automated annotation
- •Tissue-specific filtering available
- •Well-curated marker database
Cons:
- •Limited to predefined cell types
- •Requires tissue specification
- •May miss rare cell types
Reference: https://github.com/IanevskiAleksandr/sc-type
Use cases:
- •Immune tissue datasets
- •When tissue type is well-defined
- •Need for comprehensive annotation
3. hitype
Flexible annotation tool compatible with ScType database format. Supports both file-based and built-in databases.
Built-in databases:
- •
hitypedb_short: Compact marker set - •
hitypedb_full: Comprehensive marker set - •
hitypedb_pbmc3k: PBMC-specific markers (from 10X PBMC3k dataset)
Pros:
- •Faster than ScType (Python-based)
- •Multiple built-in databases
- •Tissue-specific filtering
Cons:
- •Limited to database cell types
- •Requires tissue specification
Reference: https://github.com/pwwang/hitype
Use cases:
- •PBMC datasets (use
hitypedb_pbmc3k) - •General immune annotation
- •When speed matters
4. scCATCH
Identifies cell types by matching cluster marker genes to cell type-specific marker database.
Workflow:
- •Finds marker genes for each cluster
- •Matches markers to cell type database
- •Assigns best matching cell type
Parameters:
- •
species: Human or Mouse - •
tissue: Tissue origin (required) - •
cancer: Cancer type (if applicable)
Pros:
- •Automated marker identification
- •Species-specific databases
- •Cancer type support
Cons:
- •Requires tissue specification
- •Slower (finds markers first)
- •Limited database
Reference: https://github.com/ZJUFanLab/scCATCH
Use cases:
- •When you want marker discovery + annotation
- •Cancer tissue datasets
- •Species-specific annotation
5. CellTypist
Machine learning-based annotation using pre-trained models. Requires Python environment and celltypist2 package.
Models:
- •Download from: https://celltypist.cog.sanger.ac.uk/models/models.json
- •Common models: Immune_All_Low.pkl, Immune_All_High.pkl, Tissue-specific models
Key features:
- •
majority_voting: Refines annotations within local subclusters - •
over_clustering: Over-cluster first, then merge by majority vote
Pros:
- •State-of-the-art ML models
- •Handles complex datasets well
- •Majority voting improves accuracy
Cons:
- •Requires Python environment
- •Model files need download
- •Longer runtime with majority voting
Reference: https://celltypist.org/
Use cases:
- •Large complex datasets
- •When ScType/hitype annotation is insufficient
- •High-throughput annotation
Configuration Examples
Example 1: Minimal Configuration (No Annotation)
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"]
Result: Tool defaults to "direct" with empty cell_types. Original cluster names are preserved.
Example 2: Direct Annotation for T Cell Subsets
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "direct" cell_types = ["CD4+ naive", "CD4+ memory", "CD8+ naive", "CD8+ effector", "-", "Regulatory T"]
Result: Clusters 0-3 and 5 get specified labels. Cluster 4 keeps original name (placeholder "-").
Example 3: ScType for Immune Tissue
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "sctype" sctype_tissue = "Immune system" sctype_db = "/data/databases/ScTypeDB_full.xlsx" merge = true # Merge clusters with same annotation
Result: Uses full ScType database for immune tissue. Merges clusters with identical annotations.
Example 4: hitype with Built-in PBMC Database
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "hitype" hitype_tissue = "Blood" hitype_db = "hitypedb_pbmc3k" # Built-in PBMC database merge = true
Result: Fast PBMC annotation using built-in database optimized for 10X PBMC data.
Example 5: scCATCH for Cancer Tissue
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "sccatch" [CellTypeAnnotation.envs.sccatch_args] species = "Human" tissue = "Lung" cancer = "Lung adenocarcinoma"
Result: Annotates lung adenocarcinoma dataset with cancer-specific cell types.
Example 6: CellTypist with Majority Voting
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "celltypist" [CellTypeAnnotation.envs.celltypist_args] model = "/data/models/Immune_All_Low.pkl" majority_voting = true over_clustering = "seurat_clusters" # Use clusters for majority voting python = "/usr/bin/python3" # Specify Python interpreter
Result: Uses ML model with majority voting refinement for robust annotation.
Example 7: Multiple Annotation Methods (Keep Original)
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "sctype" sctype_tissue = "Immune system" newcol = "celltype_sctype" # Create new column, keep original
Result: Annotated cell types saved in celltype_sctype column. Original seurat_clusters unchanged.
Example 8: Multiple Annotation Columns
[CellTypeAnnotation]
[CellTypeAnnotation.in]
sobjfile = ["SeuratClustering"]
[CellTypeAnnotation.envs]
tool = "direct"
cell_types = ["CD4+ T", "CD8+ T", "NK", "B", "Monocyte"]
more_cell_types = {
"celltype_broad": ["T cells", "T cells", "NK cells", "B cells", "Monocytes"],
"celltype_subset": ["CD4+ naive", "CD8+ effector", "NK", "B naive", "CD14+ Mono"]
}
Result: Creates three metadata columns: celltype (from cell_types), celltype_broad, celltype_subset.
Example 9: Exclude Clusters with NA
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "direct" cell_types = ["CD4+ T", "CD8+ T", "NA", "B cells"]
Result: Cluster 2 is removed from downstream analysis (NA excludes cluster). Note: Only works without newcol.
Example 10: H5AD Input with CellTypist
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["seurat_clustering.h5ad"] # H5AD file [CellTypeAnnotation.envs] tool = "celltypist" ident = "clusters" # Required for H5AD: cluster column name [CellTypeAnnotation.envs.celltypist_args] model = "Immune_All_Low.pkl" majority_voting = true
Result: Annotates H5AD file. ident specifies which metadata column contains clusters.
Common Patterns
Pattern 1: Standard T Cell Annotation Workflow
# Step 1: Cluster T cells [SeuratClusteringOfAllCells] [TOrBCellSelection] [SeuratClustering] # Clustering on T cells only # Step 2: Annotate T cell subsets [CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "direct" cell_types = ["Naive CD4+", "Memory CD4+", "Effector CD8+", "Tregs", "Progenitor"]
Pattern 2: Automated Immune Annotation with Backup
# Use hitype for annotation, keep original clusters [CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "hitype" hitype_tissue = "Blood" hitype_db = "hitypedb_pbmc3k" newcol = "celltype_hitype" # Keep original seurat_clusters merge = true
Pattern 3: Combine Multiple Annotation Methods
# First annotation: ScType [CellTypeAnnotation] [CellTypeAnnotation.envs] tool = "sctype" sctype_tissue = "Immune system" newcol = "celltype_sctype" # Second annotation: CellTypist for comparison [CellTypeAnnotation2] # Note: Must define separate process for second annotation # See immunopipe-config.md for multi-process setup
Pattern 4: Refine Annotation with CellTypist
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "celltypist" [CellTypeAnnotation.envs.celltypist_args] model = "Immune_All_Low.pkl" majority_voting = true over_clustering = "seurat_clusters" # Use clustering result python = "python"
Pattern 5: Tissue-Specific ScType Annotation
[CellTypeAnnotation] [CellTypeAnnotation.in] sobjfile = ["SeuratClustering"] [CellTypeAnnotation.envs] tool = "sctype" sctype_tissue = "Brain" # Brain-specific annotation sctype_db = "/data/brain_markers.xlsx" # Custom brain marker database merge = true
Dependencies
Upstream Processes
- •Required:
SeuratClustering(or process that produces Seurat object with clusters) - •Optional:
SeuratClusteringOfAllCells(if using T/B cell selection) - •Optional:
SeuratMap2Ref(can combine multiple annotation methods) - •Optional:
TOrBCellSelection(T/B-specific annotation)
Downstream Processes
- •SeuratClusterStats: Uses annotated cell types for visualization
- •ClusterMarkers: Finds markers for each cell type
- •TopExpressingGenes: Top genes per cell type
- •MarkersFinder: Flexible marker finding by cell type
- •CellCellCommunication: Uses cell types for ligand-receptor analysis
- •ScFGSEA: GSEA by cell type
- •PseudoBulkDEG: DE analysis by cell type
- •ScrnaMetabolicLandscape: Metabolic analysis by cell type
- •ScRepCombiningExpression: Integrates with TCR/BCR data
External Dependencies
- •ScType: Requires
sctypeR package - •hitype: Requires
hitypePython package - •scCATCH: Requires
scCATCHR package - •CellTypist: Requires
celltypist2Python package and Python interpreter
Validation Rules
Tool-Specific Validation
- •
ScType:
- •
sctype_tissuemust be specified (or empty string to use all tissues) - •
sctype_dbmust be a valid Excel file path (or empty for default) - •Database must contain
tissueType,cellType, andgene_shortcolumns
- •
- •
hitype:
- •
hitype_tissuemust be specified (or empty string to use all tissues) - •
hitype_dbmust be valid file path or built-in name - •Built-in names:
hitypedb_short,hitypedb_full,hitypedb_pbmc3k
- •
- •
scCATCH:
- •
speciesmust be "Human" or "Mouse" - •
tissuemust be specified - •At least 2 clusters required (scCATCH limitation)
- •
- •
CellTypist:
- •
modelmust be a valid .pkl file path - •
pythonmust be valid Python interpreter path - •CellTypist must be installed in specified Python environment
- •
- •
Direct:
- •
cell_typeslist length should match number of clusters (shorter OK, longer not) - •Placeholders "-" or "" keep original names
- •"NA" removes cluster (only without
newcol)
- •
Input Validation
- •Seurat object must have valid identity/clustering column
- •H5AD input requires
identparameter (cluster column name) - •Output directory must be writable
Output Validation
- •
cluster2celltype.tsvgenerated for ScType/hitype/scCATCH/CellTypist - •Output file format matches
outtypespecification - •Metadata contains annotated cell types
Troubleshooting
Common Issues and Solutions
Issue: "No tissues found in database" (ScType/hitype)
Cause: sctype_tissue or hitype_tissue doesn't match tissueType column in database.
Solutions:
- •Check available tissues: Open database Excel file, read
tissueTypecolumn - •Use exact match (case-sensitive)
- •Set tissue to empty string
""to use all rows in database - •Verify database file path is correct
Issue: "Not enough clusters for scCATCH"
Cause: scCATCH requires at least 2 clusters.
Solutions:
- •Ensure clustering result has ≥2 clusters
- •Increase clustering resolution in
SeuratClustering - •Use alternative tool (ScType, hitype, CellTypist)
Issue: CellTypist Python not found
Cause: CellTypist requires Python environment with celltypist2 installed.
Solutions:
- •Specify correct Python path:
celltypist_args.python = "/usr/bin/python3" - •Install celltypist2:
pip install celltypist2 - •Verify Python environment:
python -c "import celltypist; print(celltypist.__version__)"
Issue: CellTypist model file not found
Cause: Model path is incorrect or model not downloaded.
Solutions:
- •Download model from: https://celltypist.cog.sanger.ac.uk/models/models.json
- •Use absolute path for
celltypist_args.model - •Verify model file exists and is readable
Issue: "Unknown tool" error
Cause: Invalid tool value specified.
Solutions:
- •Check valid options:
direct,sctype,hitype,sccatch,celltypist - •Verify spelling is correct (case-sensitive)
- •Check tool is installed in environment
Issue: Annotations overwritten by multiple annotation processes
Cause: Multiple annotation processes write to same metadata column.
Solutions:
- •Use
newcolparameter to create separate columns:toml[CellTypeAnnotation.envs] newcol = "celltype_method1"
- •Or use
backup_colto preserve original:tomlbackup_col = "original_clusters_id"
Issue: Ambiguous cell type assignments
Cause: Clusters have similar marker expression patterns.
Solutions:
- •Increase clustering resolution for finer separation
- •Use
merge = falseto keep cluster-specific labels - •Compare multiple annotation methods for consensus
- •Manual inspection of top marker genes
Issue: Missing cell types in results
Cause: Clusters removed by "NA" placeholder or filtering.
Solutions:
- •Check
cell_typeslist for "NA" entries - •Verify
newcolis not set (NA removal only works without newcol) - •Check downstream processes for filtering
Issue: H5AD input annotation fails
Cause: ident parameter not specified for H5AD files.
Solutions:
- •Specify cluster column:
ident = "clusters"(or your cluster column name) - •Check H5AD metadata for cluster column name
- •Or convert H5AD to RDS format first
Issue: Wrong number of cell types assigned
Cause: cell_types list length doesn't match cluster count.
Solutions:
- •Check number of clusters in Seurat object
- •Ensure
cell_typeslist has correct number of entries - •Use placeholders "-" or "" for clusters to keep original names
- •Shorter lists OK (extra clusters keep original names)
Verification Steps
After annotation, verify:
- •
Check output file:
bash# View cluster to cell type mapping cat .pipen/Immunopipe/CellTypeAnnotation/0/output/cluster2celltype.tsv
- •
Check Seurat object metadata:
Rlibrary(Seurat) obj <- readRDS(".pipen/Immunopipe/CellTypeAnnotation/0/output/annotated.rds") head(obj@meta.data) # Look for cell type column (seurat_clusters or newcol name) - •
Validate annotation quality:
R# Check distribution of cell types table(Idents(obj)) # Visualize UMAP with cell types DimPlot(obj, group.by = "celltype_hitype", label = TRUE, repel = TRUE)
- •
Compare multiple methods:
R# Compare ScType vs hitype annotations table(obj$celltype_sctype, obj$celltype_hitype)
Best Practices
Method Selection
- •Start with hitype: Fast, good for PBMC/immune datasets
- •Compare with ScType: Alternative database-based method
- •Use CellTypist for complex datasets: ML-based, handles well
- •Manual refinement: Use direct annotation for corrections
Multi-Method Workflow
- •Run multiple annotation methods in parallel
- •Compare results for consensus
- •Manually refine discrepancies using direct annotation
- •Keep original cluster names for traceability
Tissue-Specific Annotation
- •Always specify tissue when using ScType/hitype
- •Use custom databases for non-standard tissues
- •Verify database contains relevant cell types
Reproducibility
- •Save cluster-to-celltype mapping (
cluster2celltype.tsv) - •Document which tool/database was used
- •Keep original cluster names using
newcolorbackup_col
External References
Tool Documentation
- •ScType: https://github.com/IanevskiAleksandr/sc-type
- •hitype: https://github.com/pwwang/hitype
- •scCATCH: https://github.com/ZJUFanLab/scCATCH
- •CellTypist: https://celltypist.org/
Database Downloads
- •ScType databases:
- •CellTypist models: https://celltypist.cog.sanger.ac.uk/models/models.json
Related Processes
- •
SeuratClustering: Clustering before annotation - •
SeuratMap2Ref: Reference-based annotation (alternative) - •
ClusterMarkers: Find markers for each cell type - •
SeuratClusterStats: Visualize annotated clusters