Scarches-Complete Skill
Comprehensive assistance with scArches (single-cell architecture surgery) development, generated from official documentation. scArches enables integration of newly produced single-cell datasets into integrated reference atlases through decentralized training and model surgery.
When to Use This Skill
This skill should be triggered when:
- •Building reference atlases using scVI, trVAE, scANVI, totalVI, or expiMap models
- •Mapping query datasets to existing reference atlases for cell type annotation
- •Performing cell type label transfer from reference to query datasets
- •Integrating multi-modal data (CITE-seq, scRNA-seq + ATAC, TCR + transcriptome)
- •Analyzing spatial transcriptomics data with SageNet
- •Working with gene programs and pathway analysis using expiMap
- •Training deep generative models for single-cell data integration
- •Debugging scArches models or optimization issues
- •Learning best practices for single-cell reference mapping
Quick Reference
Essential Code Patterns
Import and Setup
import warnings warnings.simplefilter(action='ignore') import scanpy as sc import torch import scarches as sca import numpy as np import gdown
Reference Model Training (expiMap)
# Prepare data with gene annotations
sca.utils.add_annotations(adata, 'reactome.gmt', min_genes=12, clean=True)
adata._inplace_subset_var(adata.varm['I'].sum(1)>0)
# Initialize and train model
intr_cvae = sca.models.EXPIMAP(
adata=adata,
condition_key='study',
hidden_layer_sizes=[256, 256, 256],
recon_loss='nb'
)
# Train with early stopping
early_stopping_kwargs = {
"early_stopping_metric": "val_unweighted_loss",
"threshold": 0,
"patience": 50,
"reduce_lr": True,
"lr_patience": 13,
"lr_factor": 0.1,
}
intr_cvae.train(
n_epochs=400,
alpha_epoch_anneal=100,
alpha=0.7,
alpha_kl=0.5,
early_stopping_kwargs=early_stopping_kwargs,
use_early_stopping=True
)
Query Dataset Mapping
# Load pretrained reference model
model = sca.models.SCANVI.load_query_data(adata_query, reference_model)
# Fine-tune on query data
model.train(
n_epochs=100,
train_size=1.0,
lr=1e-4,
use_early_stopping=True
)
# Get latent representation
latent = model.get_latent_representation()
Cell Type Label Transfer
# Train weighted KNN classifier
knn_model = sca.utils.weighted_knn_trainer(
train_adata,
train_adata_emb='X_emb',
n_neighbors=50
)
# Transfer labels to query
sca.utils.weighted_knn_transfer(
query_adata,
ref_adata_obs=train_adata.obs,
label_keys='cell_type',
knn_model=knn_model,
threshold=0.5
)
Multi-modal Integration (mvTCR)
# Initialize mvTCR model for TCR + transcriptome
model = sca.models.mvTCR.models.mixture_modules.moe.MoEModel(
adata_train,
params_architecture,
balanced_sampling='clonotype',
metadata=['clonotype', 'Sample', 'Type'],
conditional='Cohort'
)
# Train model
model.train(n_epochs=200, early_stop=5)
Model Sharing with Zenodo
# Upload trained model to Zenodo
download_link = sca.utils.zenodo.upload_model(
model=trained_model,
deposition_id='your_deposition_id',
access_token='your_token',
model_name='my_scarches_model'
)
# Download model from Zenodo
extract_dir = sca.utils.zenodo.download_model(
link='download_link',
save_path='models/',
extract_dir=True
)
Reference Files
This skill includes comprehensive documentation in references/:
Core Documentation
- •
api_reference.md - Complete API reference for all scArches functions and classes
- •Zenodo integration utilities for model sharing
- •Utility functions for annotations and KNN classification
- •Model training and inference methods
- •
getting_started.md - Installation and introduction guide
- •Installation via pip, conda, or from source
- •Overview of scArches capabilities and model types
- •Quick start examples and basic workflow
- •
training_tips.md - Best practices for model training
- •Loss function selection (nb, zinb, mse)
- •Hyperparameter recommendations
- •Architecture guidance for different data complexities
Advanced Tutorials
- •
tutorials_advanced.md - Specialized model tutorials
- •mvTCR: Multi-modal TCR + transcriptome integration
- •Human Lung Cell Atlas mapping and classification
- •Advanced cell type label transfer techniques
- •
tutorials_surgery_pipeline.md - Complete surgery workflow
- •Reference model construction
- •Query dataset preparation and mapping
- •Joint analysis and visualization
- •
tutorials_treearches.md - Hierarchical cell type analysis
- •Tree-based cell type discovery
- •Novel cell state identification
- •Hierarchical annotation transfer
Working with This Skill
For Beginners
- •Start with getting_started.md to understand scArches concepts and installation
- •Follow the basic surgery pipeline for your first reference mapping project
- •Use the Quick Reference examples as templates for common tasks
- •Consult training_tips.md before training your first models
For Intermediate Users
- •Explore api_reference.md for detailed function documentation
- •Try advanced tutorials for specialized applications (multi-modal, spatial)
- •Use Zenodo integration for model sharing and collaboration
- •Experiment with different models (scVI, scANVI, expiMap, totalVI) based on your data
For Advanced Users
- •Dive into model-specific tutorials for complex use cases
- •Optimize hyperparameters using training tips and experimentation
- •Implement custom workflows using the comprehensive API
- •Contribute models to community atlases using Zenodo sharing
Key Concepts
Model Types
- •scVI: Count-based integration using raw counts, assumes NB/ZINB distribution
- •trVAE: Supports normalized or count data with MMD loss for better integration
- •scANVI: Requires cell type labels for reference, enables classification
- •expiMap: Incorporates gene programs for interpretable representation learning
- •totalVI: Multi-modal CITE-seq reference construction
- •treeArches: Hierarchical cell type discovery and novel state identification
- •SageNet: Spatial transcriptomics mapping to coordinate frameworks
- •mvTCR: T-cell receptor + transcriptome joint analysis
Core Workflow
- •Reference Construction: Train model on integrated reference dataset
- •Model Surgery: Adapt pretrained model for query datasets
- •Query Mapping: Project query data into reference latent space
- •Downstream Analysis: Clustering, classification, trajectory analysis
Data Requirements
- •Raw counts preferred for scVI, scANVI, totalVI
- •Normalized data acceptable for trVAE (set recon_loss='mse')
- •Highly variable genes: Minimum 2000, increase to 5000 for complex datasets
- •Cell type labels: Required for scANVI reference, optional for query
Resources
references/
Comprehensive documentation extracted from official sources containing:
- •Detailed API documentation with parameter descriptions
- •Step-by-step tutorials with real datasets
- •Code examples with proper syntax highlighting
- •Links to original documentation for further reading
scripts/
Add helper scripts for:
- •Data preprocessing pipelines
- •Model training automation
- •Batch effect evaluation
- •Visualization utilities
assets/
Store:
- •Example datasets and preprocessing results
- •Trained model checkpoints
- •Configuration templates
- •Visualization templates
Notes
- •This skill was generated from official scArches documentation (http://127.0.0.1:9180)
- •Reference files preserve original structure and examples
- •All code examples extracted from actual tutorials and API docs
- •Training recommendations based on empirical best practices
Updating
To refresh this skill with updated documentation:
- •Re-run the documentation scraper with current scArches version
- •Update reference files with latest API changes and tutorials
- •Verify code examples against newest scArches release
- •Test training workflows with updated hyperparameters
Common Use Cases
Cell Type Annotation
# Map query to reference and transfer labels
query_adata = sca.utils.read('query_data.h5ad')
model = sca.models.SCANVI.load_query_data(query_adata, ref_model)
model.train(max_epochs=400)
predictions = model.predict(query_adata)
Multi-modal Integration
# CITE-seq data integration model = sca.models.TOTALVI(adata) model.train() latent_rna, latent_protein = model.get_latent_representation()
Spatial Mapping
# Map scRNA-seq to spatial reference sage_model = sca.models.SageNet(spatial_ref, query_sc) spatial_predictions = sage_model.predict_locations(query_sc)
Gene Program Analysis
# Analyze query in context of known pathways expimap_model = sca.models.EXPIMAP(reference, gene_sets='reactome') gp_activities = expimap_model.get_gene_program_scores(query_data)