AgentSkillsCN

bio-genome-assembly-long-read-assembly

利用Flye和Canu从牛津纳米孔或PacBio的长读长中进行从头基因组组装。可生成高度连续的组装结果,适用于完整细菌基因组的构建,以及复杂区域的解析。适用于从ONT或PacBio读长中组装基因组时使用。

SKILL.md
--- frontmatter
name: bio-genome-assembly-long-read-assembly
description: De novo genome assembly from Oxford Nanopore or PacBio long reads using Flye and Canu. Produces highly contiguous assemblies suitable for complete bacterial genomes and resolving complex regions. Use when assembling genomes from ONT or PacBio reads.
tool_type: cli
primary_tool: Flye

Long-Read Assembly

Assemble genomes from Oxford Nanopore (ONT) or PacBio long reads for highly contiguous assemblies.

Tool Comparison

ToolSpeedMemoryBest For
FlyeFastModerateGeneral purpose, bacteria, ONT
CanuSlowHighHigh accuracy, complex genomes
Wtdbg2Very fastLowDraft assemblies

Note: For PacBio HiFi data, see the dedicated hifi-assembly skill which covers hifiasm.

Flye

Installation

bash
conda install -c bioconda flye

Basic Usage

bash
# Oxford Nanopore
flye --nano-raw reads.fastq.gz --out-dir flye_output --threads 16

# PacBio CLR
flye --pacbio-raw reads.fastq.gz --out-dir flye_output --threads 16

# PacBio HiFi
flye --pacbio-hifi reads.fastq.gz --out-dir flye_output --threads 16

Read Type Options

OptionRead Type
--nano-rawONT regular reads
--nano-corrONT corrected reads
--nano-hqONT Q20+ reads (Guppy 5+)
--pacbio-rawPacBio CLR
--pacbio-corrPacBio corrected
--pacbio-hifiPacBio HiFi/CCS

Key Options

OptionDescription
--out-dirOutput directory
--threadsNumber of threads
--genome-sizeEstimated genome size (e.g., 5m, 100m)
--iterationsPolishing iterations (default: 1)
--metaMetagenome mode
--plasmidsRecover plasmids
--keep-haplotypesDon't collapse haplotypes
--scaffoldEnable scaffolding

Genome Size Estimation

bash
# Estimate if unknown
flye --nano-raw reads.fq.gz --out-dir output --genome-size 5m

# Size formats: 1000, 1k, 1m, 1g

Output Files

code
flye_output/
├── assembly.fasta       # Final assembly
├── assembly_graph.gfa   # Assembly graph
├── assembly_info.txt    # Contig statistics
└── flye.log             # Log file

Bacterial Assembly

bash
flye \
    --nano-raw bacteria.fastq.gz \
    --out-dir bacteria_assembly \
    --genome-size 5m \
    --threads 16

Metagenome Assembly

bash
flye \
    --nano-raw metagenome.fastq.gz \
    --out-dir meta_assembly \
    --meta \
    --threads 32

With Plasmid Recovery

bash
flye \
    --nano-raw isolate.fastq.gz \
    --out-dir assembly \
    --plasmids \
    --threads 16

Canu

Installation

bash
conda install -c bioconda canu

Basic Usage

bash
# ONT reads
canu -p assembly -d canu_output genomeSize=5m -nanopore reads.fastq.gz

# PacBio HiFi
canu -p assembly -d canu_output genomeSize=5m -pacbio-hifi reads.fastq.gz

Key Options

OptionDescription
-pAssembly prefix
-dOutput directory
genomeSize=Estimated size (required)
maxThreads=Max threads
maxMemory=Max memory (e.g., 64g)
useGrid=falseDisable grid execution
correctedErrorRate=Expected error rate

Read Type Options

OptionRead Type
-nanoporeONT reads
-nanopore-rawONT raw (deprecated)
-pacbioPacBio CLR
-pacbio-hifiPacBio HiFi/CCS

Fast Mode

bash
canu -p asm -d output genomeSize=5m \
    -nanopore reads.fq.gz \
    useGrid=false \
    maxThreads=16 \
    maxMemory=32g

High-Quality Mode (PacBio HiFi)

bash
canu -p asm -d output genomeSize=5m \
    -pacbio-hifi reads.fq.gz \
    correctedErrorRate=0.01

Output Files

code
canu_output/
├── assembly.contigs.fasta   # Contigs
├── assembly.unassembled.fasta
├── assembly.report
└── assembly.seqStore/

Wtdbg2 (Fast Draft)

Installation

bash
conda install -c bioconda wtdbg

Basic Usage

bash
# Assemble
wtdbg2 -x ont -g 5m -t 16 -i reads.fq.gz -o draft

# Consensus
wtpoa-cns -t 16 -i draft.ctg.lay.gz -o draft.ctg.fa

Platform Presets

PresetPlatform
-x ontONT R9
-x ccsPacBio HiFi
-x rsPacBio CLR
-x sqONT R10

Complete Workflows

ONT Bacterial Assembly

bash
#!/bin/bash
set -euo pipefail

READS=$1
OUTDIR=$2
SIZE=${3:-5m}

echo "=== ONT Bacterial Assembly ==="

# Flye assembly
flye \
    --nano-raw $READS \
    --out-dir ${OUTDIR}/flye \
    --genome-size $SIZE \
    --threads 16

# Stats
echo "Assembly statistics:"
cat ${OUTDIR}/flye/assembly_info.txt

echo "Assembly: ${OUTDIR}/flye/assembly.fasta"

Hybrid Assembly (Long + Short)

bash
#!/bin/bash
set -euo pipefail

LONG=$1
SHORT_R1=$2
SHORT_R2=$3
OUTDIR=$4

# 1. Long-read assembly with Flye
flye --nano-raw $LONG --out-dir ${OUTDIR}/flye --genome-size 5m --threads 16

# 2. Polish with short reads (Pilon)
# See assembly-polishing skill

Quality Expectations

MetricBacterialEukaryotic
Contigs1-10100-1000+
N50>1 MbVariable
Complete chromosomesOftenRare

Troubleshooting

Low Contiguity

  • Check coverage (need >30x)
  • Try increasing iterations in Flye
  • Consider supplementing with short reads

Memory Issues

  • Use Flye (more memory efficient)
  • Reduce threads
  • Filter reads by length/quality

Misassemblies

  • Polish with Pilon/medaka
  • Validate with short reads
  • Check for contamination

Related Skills

  • hifi-assembly - PacBio HiFi assembly with hifiasm
  • assembly-polishing - Polish long-read assemblies
  • assembly-qc - QUAST and BUSCO assessment
  • short-read-assembly - Hybrid with Illumina
  • long-read-sequencing - Read QC and alignment