TCGA bulk data preprocessing with omicverse
Overview
Follow this skill to recreate the preprocessing routine from t_tcga.ipynb. It automates loading TCGA downloads, generating raw/normalised matrices, initialising metadata, and running survival analyses through ov.bulk.pyTCGA.
Instructions
- •Gather required downloads
- •Confirm the user has:
- •
gdc_sample_sheet.<date>.tsvfrom the TCGA Sample Sheet export. - •The decompressed
gdc_download_xxxxxdirectory containing expression archives. - •The
clinical.cart.<date>directory with clinical XML/JSON files.
- •
- •Mention that sample data are available under
omicverse_guide/docs/Tutorials-bulk/data/TCGA_OV/.
- •Confirm the user has:
- •Initialise the TCGA helper
- •Import
omicverse as ov(andscanpy as scif plotting) then callov.plot_set(). - •Instantiate
aml_tcga = ov.bulk.pyTCGA(sample_sheet_path, download_dir, clinical_dir). - •Run
aml_tcga.adata_init()to build the AnnData object with raw counts, FPKM, and TPM layers.
- •Import
- •Persist the dataset
- •Encourage saving the initial AnnData:
aml_tcga.adata.write_h5ad('data/TCGA_OV/ov_tcga_raw.h5ad', compression='gzip'). - •When reloading, reconstruct the class with the same paths and call
aml_tcga.adata_read(<path>).
- •Encourage saving the initial AnnData:
- •Initialise metadata and clinical information
- •Populate sample metadata using
aml_tcga.adata_meta_init()to convert gene IDs to symbols and attach patient info. - •Add survival attributes via
aml_tcga.survial_init()(note the intentional spelling in the API).
- •Populate sample metadata using
- •Perform survival analyses
- •Plot gene-level survival curves with
aml_tcga.survival_analysis('GENE', layer='deseq_normalize', plot=True). - •To process all genes, call
aml_tcga.survial_analysis_all(); warn that it may take time.
- •Plot gene-level survival curves with
- •Export results
- •Save enriched metadata to a new AnnData file (
aml_tcga.adata.write_h5ad('.../ov_tcga_survial_all.h5ad', compression='gzip')). - •Suggest exporting summary tables (e.g., survival statistics) if users need to share outputs outside Python.
- •Save enriched metadata to a new AnnData file (
- •Troubleshooting tips
- •Ensure TCGA archives are fully extracted; missing XML/TSV files trigger parsing errors.
- •The helper expects matching case IDs between the sample sheet and expression files—direct users to re-download if IDs do not align.
- •Survival plots require clinical dates; if absent, instruct users to check the
clinical_cartcontents.
Examples
- •"Read my TCGA OV download, initialise metadata, and plot MYC survival curves using DESeq-normalised counts."
- •"Reload a saved AnnData file, attach survival annotations, and export the updated
.h5ad." - •"Run survival analysis for all genes and store the enriched dataset."
References
- •Tutorial notebook:
t_tcga.ipynb - •Sample dataset:
data/TCGA_OV/ - •Quick copy/paste commands:
reference.md