AgentSkillsCN

bio-alignment-indexing

使用 samtools 和 pysam 为 BAM/CRAM 文件创建 BAI/CSI 索引。在启用比对文件的随机访问功能,或提取特定基因组区域时,可选用此功能。

SKILL.md
--- frontmatter
name: bio-alignment-indexing
description: Create and use BAI/CSI indices for BAM/CRAM files using samtools and pysam. Use when enabling random access to alignment files or fetching specific genomic regions.
tool_type: cli
primary_tool: samtools

Alignment Indexing

Create indices for random access to alignment files using samtools and pysam.

Index Types

IndexExtensionUse Case
BAI.baiStandard BAM index, chromosomes < 512 Mbp
CSI.csiLarge chromosomes, custom bin sizes
CRAI.craiCRAM index

samtools index

Create BAI Index

bash
samtools index input.bam
# Creates input.bam.bai

Create CSI Index

bash
samtools index -c input.bam
# Creates input.bam.csi

Specify Output Name

bash
samtools index input.bam output.bai

Multi-threaded Indexing

bash
samtools index -@ 4 input.bam

Index CRAM

bash
samtools index input.cram
# Creates input.cram.crai

Index Requirements

Indexing requires coordinate-sorted files:

bash
# Check sort order
samtools view -H input.bam | grep "^@HD"
# Should show SO:coordinate

# Sort if needed, then index
samtools sort -o sorted.bam input.bam
samtools index sorted.bam

Using Indices for Region Access

samtools view with Region

bash
# Requires index file present
samtools view input.bam chr1:1000000-2000000

Multiple Regions

bash
samtools view input.bam chr1:1000-2000 chr2:3000-4000

Regions from BED File

bash
samtools view -L regions.bed input.bam

pysam Python Alternative

Create Index

python
import pysam

pysam.index('input.bam')
# Creates input.bam.bai

Create CSI Index

python
pysam.index('input.bam', 'input.bam.csi', csi=True)

Fetch with Index

python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    # fetch() requires index
    for read in bam.fetch('chr1', 1000000, 2000000):
        print(read.query_name)

Check if Indexed

python
import pysam
from pathlib import Path

def is_indexed(bam_path):
    bam_path = Path(bam_path)
    return (bam_path.with_suffix('.bam.bai').exists() or
            Path(str(bam_path) + '.bai').exists() or
            bam_path.with_suffix('.bam.csi').exists())

if not is_indexed('input.bam'):
    pysam.index('input.bam')

Fetch Multiple Regions

python
regions = [('chr1', 1000, 2000), ('chr1', 5000, 6000), ('chr2', 1000, 2000)]

with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for chrom, start, end in regions:
        count = sum(1 for _ in bam.fetch(chrom, start, end))
        print(f'{chrom}:{start}-{end}: {count} reads')

Count Reads in Region

python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    count = bam.count('chr1', 1000000, 2000000)
    print(f'Reads in region: {count}')

Get Reads Covering Position

python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for read in bam.fetch('chr1', 1000000, 1000001):
        if read.reference_start <= 1000000 < read.reference_end:
            print(f'{read.query_name} covers position 1000000')

Index File Locations

samtools looks for indices in two locations:

code
input.bam.bai   # Standard location
input.bai       # Alternative location

For CRAM:

code
input.cram.crai

idxstats - Index Statistics

Get Per-Chromosome Counts

bash
samtools idxstats input.bam

Output format:

code
chr1    248956422    5000000    0
chr2    242193529    4500000    0
*       0            0          10000

Columns: reference name, length, mapped reads, unmapped reads

Sum Total Mapped Reads

bash
samtools idxstats input.bam | awk '{sum += $3} END {print sum}'

pysam idxstats

python
with pysam.AlignmentFile('input.bam', 'rb') as bam:
    for stat in bam.get_index_statistics():
        print(f'{stat.contig}: {stat.mapped} mapped, {stat.unmapped} unmapped')

FASTA Index (faidx)

Related but different - index reference FASTA for random access:

bash
samtools faidx reference.fa
# Creates reference.fa.fai

# Fetch region from indexed FASTA
samtools faidx reference.fa chr1:1000-2000

pysam FastaFile

python
with pysam.FastaFile('reference.fa') as ref:
    seq = ref.fetch('chr1', 1000, 2000)
    print(seq)

Quick Reference

Tasksamtoolspysam
Create BAIsamtools index file.bampysam.index('file.bam')
Create CSIsamtools index -c file.bampysam.index('file.bam', csi=True)
Fetch regionsamtools view file.bam chr1:1-1000bam.fetch('chr1', 0, 1000)
Count in regionsamtools view -c file.bam chr1:1-1000bam.count('chr1', 0, 1000)
Index statssamtools idxstats file.bambam.get_index_statistics()
Index FASTAsamtools faidx ref.faAutomatic with FastaFile

Common Errors

ErrorCauseSolution
random alignment retrieval only works for indexed BAMMissing indexRun samtools index file.bam
file is not sortedUnsorted BAMSort first with samtools sort
chromosome not foundWrong chromosome nameCheck names with samtools view -H

Related Skills

  • sam-bam-basics - View and convert alignment files
  • alignment-sorting - Sort BAM files (required before indexing)
  • alignment-filtering - Filter by regions using index
  • bam-statistics - Use idxstats for quick counts
  • sequence-io/read-sequences - Index FASTA with SeqIO.index_db()