AgentSkillsCN

cartapa-dataset-checker

为 CartaPA 分析准备并质量检测空间蛋白质组学数据集。当您需要触发“检查数据集”、“验证 h5ad 文件”、“核查数据质量”、“数据 QC”、“检查细胞类型注释”、“验证嵌入向量”、“数据验证”等关键词时,可选择此标签。该技能会检查细胞数量、观测元数据(治疗状态、患者 ID、细胞类型)、坐标有效性、嵌入维度,以及已知的数据集特有问题。支持 CODEX-HCC、CODEX-TNBC、IMC-TNBC、SAFE-HNSCC 等数据集。

SKILL.md
--- frontmatter
name: cartapa-dataset-checker
description: >-
  Validate and quality-check spatial proteomics datasets for CartaPA analysis.
  Triggers: "check dataset", "validate h5ad", "verify data quality", "dataset QC",
  "check celltype annotations", "verify embeddings", "data validation".
  Checks cell counts, obs metadata (treatment status, patient ID, celltypes),
  coordinate validity, embedding dimensions, and known dataset-specific issues.
  Supports CODEX-HCC, CODEX-TNBC, IMC-TNBC, SAFE-HNSCC datasets.

CartaPA Dataset Checker

Validate spatial proteomics datasets before CartaPA model training or embedding extraction.

Quick Validation

python
# Run comprehensive check on any h5ad
python scripts/check_dataset.py --input data.h5ad --dataset-type auto

Core Checks

1. Basic Structure

  • Cell count matches expected range per dataset
  • Required obs columns present: slice_id, patient_id, cell_type/celltype
  • Spatial coordinates valid (no NaN, reasonable range)
  • Expression matrix shape matches var count

2. Treatment/Response Labels

  • state or treatment column: check for pre/post values
  • response or label column: check for 0/1 or R/NR values
  • Verify consistency within patient (all cells from same patient have same response)

3. Celltype Annotations

Check for known issues (see references/celltype-issues.md):

DatasetIssueSolution
IMC-TNBCEpithelial cells not labeled "epi", may look like immuneCheck paper for actual marker names
SAFE-HNSCCMissing stromal cell categoryVerify with original annotation
CODEX-TNBCUnclear pre/post labels in raw dataCross-reference metadata carefully

4. Coordinate Validation

  • Check for tile-level vs slice-level coordinates (T4 issue)
  • Flag if all cells share same small coordinate range (~0-1500)
  • Verify no coordinate overlap between slices

5. Embedding Quality (if present)

  • obsm['X_cartapa'] shape should be (N, 128)
  • Check for NaN or Inf values
  • Verify response probability range [0, 1]

Dataset-Specific Expectations

DatasetCellsSlicesProteinsKey obs columns
CODEX-HCC Pre~490K2451state=Pre, patient_id, celltype
CODEX-TNBC~1.9M2856patient_id, pre_or_post, celltype
IMC-TNBC~1M24341patient_id, treatment, celltype
SAFE-HNSCC~2.1M4127slice_id, patient_id, celltype

Validation Commands

bash
# Full validation with report
python scripts/check_dataset.py --input data.h5ad --report results/validation_report.md

# Quick celltype summary
python scripts/check_dataset.py --input data.h5ad --quick --celltypes-only

# Check coordinates for tile issues
python scripts/check_dataset.py --input data.h5ad --check-coords --visualize

Output

Validation produces:

  • Console summary with pass/fail indicators
  • Optional markdown report with detailed findings
  • Optional visualization of coordinate distribution

When to Use

Run validation:

  1. Before model training: Ensure data quality
  2. After embedding extraction: Verify embeddings attached correctly
  3. When integrating new datasets: Check format compatibility
  4. After coordinate fixes: Verify spatial structure