AgentSkillsCN

bio-workflows-metagenomics-pipeline

从 FASTQ 文件到分类学与功能特征图谱的全流程宏基因组工作流。涵盖 Kraken2 分类、Bracken 宏量估计,以及 HUMAnN 功能特征分析。适用于对宏基因组样本进行特征分析时使用。

SKILL.md
--- frontmatter
name: bio-workflows-metagenomics-pipeline
description: End-to-end metagenomics workflow from FASTQ to taxonomic and functional profiles. Covers Kraken2 classification, Bracken abundance estimation, and HUMAnN functional profiling. Use when profiling metagenomic samples.
tool_type: cli
primary_tool: Kraken2
workflow: true
depends_on:
  - read-qc/fastp-workflow
  - metagenomics/kraken-classification
  - metagenomics/metaphlan-profiling
  - metagenomics/abundance-estimation
  - metagenomics/functional-profiling
  - metagenomics/metagenome-visualization
qc_checkpoints:
  - after_qc: "Q30 >80%, host reads removed"
  - after_classification: "Classification rate >60%, known taxa dominant"
  - after_functional: "Pathway coverage reasonable, unmapped <50%"

Metagenomics Pipeline

Complete workflow from metagenomic FASTQ to taxonomic and functional profiles.

Workflow Overview

code
FASTQ files
    |
    v
[1. QC & Host Removal] --> fastp + Bowtie2
    |
    v
[2. Taxonomic Classification]
    |
    +---> Kraken2 + Bracken (fast, database-dependent)
    |
    +---> MetaPhlAn (marker-based, standardized)
    |
    v
[3. Functional Profiling] --> HUMAnN
    |
    v
Taxonomic profiles + Pathway abundances

Primary Path: Kraken2 + Bracken + HUMAnN

Step 1: Quality Control and Host Removal

bash
# QC with fastp
for sample in sample1 sample2 sample3; do
    fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
        -o trimmed/${sample}_R1.fq.gz -O trimmed/${sample}_R2.fq.gz \
        --detect_adapter_for_pe \
        --qualified_quality_phred 20 \
        --length_required 50 \
        --html qc/${sample}_fastp.html
done

# Remove host reads (human example)
for sample in sample1 sample2 sample3; do
    bowtie2 -p 8 -x human_index \
        -1 trimmed/${sample}_R1.fq.gz \
        -2 trimmed/${sample}_R2.fq.gz \
        --un-conc-gz host_removed/${sample}_R%.fq.gz \
        > /dev/null 2> qc/${sample}_host_removal.log
done

Step 2A: Kraken2 Classification

bash
# Classify reads
for sample in sample1 sample2 sample3; do
    kraken2 --db kraken2_db \
        --threads 8 \
        --paired \
        --report kraken/${sample}.report \
        --output kraken/${sample}.output \
        host_removed/${sample}_R1.fq.gz \
        host_removed/${sample}_R2.fq.gz
done

Step 2B: Bracken Abundance Estimation

bash
# Estimate species abundance
for sample in sample1 sample2 sample3; do
    bracken -d kraken2_db \
        -i kraken/${sample}.report \
        -o bracken/${sample}.species.txt \
        -r 150 \
        -l S \
        -t 10
done

# Combine samples into abundance matrix
combine_bracken_outputs.py \
    --files bracken/*.species.txt \
    -o bracken/combined_species.txt

Step 2C: Alternative - MetaPhlAn Profiling

bash
# Profile with MetaPhlAn 4
for sample in sample1 sample2 sample3; do
    metaphlan host_removed/${sample}_R1.fq.gz,host_removed/${sample}_R2.fq.gz \
        --bowtie2out metaphlan/${sample}.bowtie2.bz2 \
        --input_type fastq \
        --nproc 8 \
        -o metaphlan/${sample}_profile.txt
done

# Merge profiles
merge_metaphlan_tables.py metaphlan/*_profile.txt > metaphlan/merged_abundance.txt

Step 3: Functional Profiling with HUMAnN

bash
# Run HUMAnN
for sample in sample1 sample2 sample3; do
    # Concatenate paired reads
    cat host_removed/${sample}_R1.fq.gz host_removed/${sample}_R2.fq.gz > \
        host_removed/${sample}_concat.fq.gz

    humann --input host_removed/${sample}_concat.fq.gz \
        --output humann/${sample} \
        --threads 8 \
        --metaphlan-options "--bowtie2db metaphlan_db"
done

# Normalize and join tables
humann_renorm_table --input humann/sample1/sample1_pathabundance.tsv \
    --output humann/sample1/sample1_pathabundance_cpm.tsv \
    --units cpm

humann_join_tables --input humann \
    --output humann/merged_pathabundance.tsv \
    --file_name pathabundance

Visualization

python
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load Bracken species table
species = pd.read_csv('bracken/combined_species.txt', sep='\t', index_col=0)

# Top 20 species heatmap
top20 = species.sum(axis=1).nlargest(20).index
plt.figure(figsize=(12, 8))
sns.heatmap(species.loc[top20], cmap='viridis', annot=False)
plt.title('Top 20 Species Abundance')
plt.tight_layout()
plt.savefig('top20_species_heatmap.pdf')

# Stacked bar plot
species_norm = species.div(species.sum()) * 100
top10 = species_norm.sum(axis=1).nlargest(10).index
other = species_norm.loc[~species_norm.index.isin(top10)].sum()

plot_data = species_norm.loc[top10].T
plot_data['Other'] = other
plot_data.plot(kind='bar', stacked=True, figsize=(10, 6))
plt.ylabel('Relative Abundance (%)')
plt.legend(bbox_to_anchor=(1.05, 1))
plt.tight_layout()
plt.savefig('species_barplot.pdf')

Parameter Recommendations

StepParameterValue
fastp--length_required50 (metagenomic reads)
Kraken2--confidence0.0 (default) or 0.1
Bracken-rRead length (e.g., 150)
Bracken-lS (species) or G (genus)
Bracken-t10 (min reads threshold)
MetaPhlAn--min_cu_len2000 (default)
HUMAnN--threads8+

Troubleshooting

IssueLikely CauseSolution
Low classification rateDatabase mismatch, novel organismsTry different database, check sample type
High unclassifiedNovel microbes, host contaminationRemove host, use larger database
High host readsIncomplete host removalUse multiple host reference genomes
HUMAnN slowLarge filesIncrease threads, pre-filter reads

Complete Pipeline Script

bash
#!/bin/bash
set -e

THREADS=8
KRAKEN_DB="kraken2_standard_db"
HOST_INDEX="human_bt2_index"
SAMPLES="sample1 sample2 sample3"
OUTDIR="metagenomics_results"

mkdir -p ${OUTDIR}/{trimmed,host_removed,kraken,bracken,metaphlan,humann,qc}

# Step 1: QC
echo "=== QC ==="
for sample in $SAMPLES; do
    fastp -i ${sample}_R1.fastq.gz -I ${sample}_R2.fastq.gz \
        -o ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
        -O ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
        --length_required 50 \
        --html ${OUTDIR}/qc/${sample}_fastp.html -w ${THREADS}
done

# Host removal
echo "=== Host Removal ==="
for sample in $SAMPLES; do
    bowtie2 -p ${THREADS} -x ${HOST_INDEX} \
        -1 ${OUTDIR}/trimmed/${sample}_R1.fq.gz \
        -2 ${OUTDIR}/trimmed/${sample}_R2.fq.gz \
        --un-conc-gz ${OUTDIR}/host_removed/${sample}_R%.fq.gz \
        > /dev/null 2> ${OUTDIR}/qc/${sample}_host.log
done

# Step 2: Kraken2
echo "=== Kraken2 ==="
for sample in $SAMPLES; do
    kraken2 --db ${KRAKEN_DB} --threads ${THREADS} --paired \
        --report ${OUTDIR}/kraken/${sample}.report \
        --output ${OUTDIR}/kraken/${sample}.output \
        ${OUTDIR}/host_removed/${sample}_R1.fq.gz \
        ${OUTDIR}/host_removed/${sample}_R2.fq.gz
done

# Bracken
echo "=== Bracken ==="
for sample in $SAMPLES; do
    bracken -d ${KRAKEN_DB} \
        -i ${OUTDIR}/kraken/${sample}.report \
        -o ${OUTDIR}/bracken/${sample}.species.txt \
        -r 150 -l S -t 10
done

echo "=== Pipeline Complete ==="
echo "Kraken reports: ${OUTDIR}/kraken/"
echo "Bracken abundances: ${OUTDIR}/bracken/"

Related Skills

  • metagenomics/kraken-classification - Kraken2 details
  • metagenomics/metaphlan-profiling - MetaPhlAn parameters
  • metagenomics/abundance-estimation - Bracken options
  • metagenomics/functional-profiling - HUMAnN workflow
  • metagenomics/metagenome-visualization - Plotting functions