AgentSkillsCN

bio-vcf-basics

利用 bcftools 和 cyvcf2 查看、查询并理解 VCF/BCF 变异文件。在检查变异、提取特定字段,或了解 VCF 格式结构时使用。

SKILL.md
--- frontmatter
name: bio-vcf-basics
description: View, query, and understand VCF/BCF variant files using bcftools and cyvcf2. Use when inspecting variants, extracting specific fields, or understanding VCF format structure.
tool_type: cli
primary_tool: bcftools

VCF/BCF Basics

View and query variant files using bcftools and cyvcf2.

Format Overview

FormatDescriptionUse Case
VCFText format, human-readableDebugging, small files
VCF.gzCompressed VCF (bgzip)Standard distribution
BCFBinary VCFFast processing, large files

VCF Format Structure

code
##fileformat=VCFv4.2
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SAMPLE1
chr1    1000    rs123   A       G       30      PASS    DP=50   GT:DP   0/1:25

Header Lines (##)

  • ##fileformat - VCF version
  • ##INFO - INFO field definitions
  • ##FORMAT - FORMAT field definitions
  • ##FILTER - Filter definitions
  • ##contig - Reference contigs
  • ##reference - Reference genome

Column Header (#CHROM)

Fixed columns: CHROM, POS, ID, REF, ALT, QUAL, FILTER, INFO, FORMAT Followed by sample columns

Data Columns

ColumnDescription
CHROMChromosome
POS1-based position
IDVariant identifier (e.g., rs number)
REFReference allele
ALTAlternate allele(s), comma-separated
QUALPhred-scaled quality score
FILTERPASS or filter name
INFOSemicolon-separated key=value pairs
FORMATColon-separated format keys
SAMPLEColon-separated values matching FORMAT

bcftools view

View VCF

bash
bcftools view input.vcf.gz | head

View Header Only

bash
bcftools view -h input.vcf.gz

View Without Header

bash
bcftools view -H input.vcf.gz | head

View Specific Region

bash
bcftools view input.vcf.gz chr1:1000000-2000000

View Specific Samples

bash
bcftools view -s sample1,sample2 input.vcf.gz

Exclude Samples

bash
bcftools view -s ^sample3 input.vcf.gz

bcftools query

Extract specific fields in custom format.

Basic Query

bash
bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\n' input.vcf.gz

Query with INFO Fields

bash
bcftools query -f '%CHROM\t%POS\t%INFO/DP\t%INFO/AF\n' input.vcf.gz

Query with Sample Fields

bash
bcftools query -f '%CHROM\t%POS[\t%GT]\n' input.vcf.gz

Query Specific Samples

bash
bcftools query -f '%CHROM\t%POS[\t%SAMPLE=%GT]\n' -s sample1,sample2 input.vcf.gz

Include Header

bash
bcftools query -H -f '%CHROM\t%POS\t%REF\t%ALT\n' input.vcf.gz

Common Format Specifiers

SpecifierDescription
%CHROMChromosome
%POSPosition
%IDVariant ID
%REFReference allele
%ALTAlternate allele
%QUALQuality score
%FILTERFilter status
%INFO/TAGINFO field value
%TYPEVariant type (snp, indel, etc.)
[%GT]Genotype (per sample)
[%DP]Depth (per sample)
[%SAMPLE]Sample name
\nNewline
\tTab

Format Conversion

VCF to BCF

bash
bcftools view -Ob -o output.bcf input.vcf.gz

BCF to VCF

bash
bcftools view -Ov -o output.vcf input.bcf

Compress VCF (bgzip)

bash
bgzip input.vcf
# Creates input.vcf.gz

Index VCF/BCF

bash
bcftools index input.vcf.gz
# Creates input.vcf.gz.csi

bcftools index -t input.vcf.gz
# Creates input.vcf.gz.tbi (tabix index)

Output Format Options

FlagFormat
-OvUncompressed VCF
-OzCompressed VCF (bgzip)
-OuUncompressed BCF
-ObCompressed BCF

Genotype Encoding

GenotypeMeaning
0/0Homozygous reference
0/1Heterozygous
1/1Homozygous alternate
1/2Heterozygous (two different alts)
./.Missing
0|1Phased heterozygous

cyvcf2 Python Alternative

Open and Iterate

python
from cyvcf2 import VCF

vcf = VCF('input.vcf.gz')
for variant in vcf:
    print(f'{variant.CHROM}:{variant.POS} {variant.REF}>{variant.ALT[0]}')

Access Variant Properties

python
from cyvcf2 import VCF

for variant in VCF('input.vcf.gz'):
    print(f'Chrom: {variant.CHROM}')
    print(f'Pos: {variant.POS}')
    print(f'ID: {variant.ID}')
    print(f'Ref: {variant.REF}')
    print(f'Alt: {variant.ALT}')  # List
    print(f'Qual: {variant.QUAL}')
    print(f'Filter: {variant.FILTER}')
    print(f'Type: {variant.var_type}')  # snp, indel, etc.
    break

Access INFO Fields

python
from cyvcf2 import VCF

for variant in VCF('input.vcf.gz'):
    dp = variant.INFO.get('DP')
    af = variant.INFO.get('AF')
    print(f'{variant.CHROM}:{variant.POS} DP={dp} AF={af}')

Access Genotypes

python
from cyvcf2 import VCF

vcf = VCF('input.vcf.gz')
samples = vcf.samples  # List of sample names

for variant in vcf:
    gts = variant.gt_types  # 0=HOM_REF, 1=HET, 2=UNKNOWN, 3=HOM_ALT
    for sample, gt in zip(samples, gts):
        gt_str = ['HOM_REF', 'HET', 'UNKNOWN', 'HOM_ALT'][gt]
        print(f'{sample}: {gt_str}')
    break

Access Sample Fields

python
from cyvcf2 import VCF

for variant in VCF('input.vcf.gz'):
    depths = variant.format('DP')  # numpy array
    gqs = variant.format('GQ')     # Genotype quality
    print(f'Depths: {depths}')

Fetch Region

python
from cyvcf2 import VCF

vcf = VCF('input.vcf.gz')
for variant in vcf('chr1:1000000-2000000'):
    print(f'{variant.CHROM}:{variant.POS}')

Get Header Info

python
from cyvcf2 import VCF

vcf = VCF('input.vcf.gz')
print(f'Samples: {vcf.samples}')
print(f'Contigs: {vcf.seqnames}')

# INFO field definitions
for info in vcf.header_iter():
    if info['HeaderType'] == 'INFO':
        print(f'{info["ID"]}: {info["Description"]}')

Write VCF

python
from cyvcf2 import VCF, Writer

vcf = VCF('input.vcf.gz')
writer = Writer('output.vcf', vcf)

for variant in vcf:
    if variant.QUAL > 30:
        writer.write_record(variant)

writer.close()
vcf.close()

Quick Reference

Taskbcftoolscyvcf2
View VCFbcftools view file.vcf.gzVCF('file.vcf.gz')
View headerbcftools view -h file.vcf.gzvcf.header_iter()
Get regionbcftools view file.vcf.gz chr1:1-1000vcf('chr1:1-1000')
Query fieldsbcftools query -f '%CHROM\t%POS\n'Loop with properties
Count variantsbcftools view -H file.vcf.gz | wc -lsum(1 for _ in vcf)
VCF to BCFbcftools view -Ob -o out.bcf in.vcf.gzUse Writer

Common Errors

ErrorCauseSolution
no BGZF EOF markerNot bgzippedUse bgzip not gzip
index requiredMissing index for region queryRun bcftools index
sample not foundWrong sample nameCheck with bcftools query -l

Related Skills

  • variant-calling - Generate VCF from alignments
  • filtering-best-practices - Filter variants by quality/criteria
  • vcf-manipulation - Merge, concat, intersect VCFs
  • alignment-files/pileup-generation - Generate pileup for calling