Variant Normalization
Left-align indels and split multiallelic sites using bcftools norm.
Why Normalize?
The same variant can be represented multiple ways:
code
# Same deletion, different representations chr1 100 ATCG A (right-aligned) chr1 100 ATC A (left-aligned, normalized) chr1 101 TCG T (different position)
Normalization ensures consistent representation for:
- •Comparing variants from different callers
- •Database lookups (dbSNP, ClinVar)
- •Merging VCF files
bcftools norm
Left-Align Indels
bash
bcftools norm -f reference.fa input.vcf.gz -Oz -o normalized.vcf.gz
Requires reference FASTA to determine left-most representation.
Check for Normalization Issues
bash
bcftools norm -f reference.fa -c s input.vcf.gz > /dev/null # Reports REF allele mismatches
Check modes (-c):
- •
w- Warn on mismatch (default) - •
e- Error on mismatch - •
x- Exclude mismatches - •
s- Set correct REF from reference
Multiallelic Sites
Split Multiallelic to Biallelic
bash
bcftools norm -m-any input.vcf.gz -Oz -o split.vcf.gz
Before:
code
chr1 100 . A G,T 30 PASS . GT 1/2
After:
code
chr1 100 . A G 30 PASS . GT 1/0 chr1 100 . A T 30 PASS . GT 0/1
Split SNPs Only
bash
bcftools norm -m-snps input.vcf.gz -Oz -o split_snps.vcf.gz
Split Indels Only
bash
bcftools norm -m-indels input.vcf.gz -Oz -o split_indels.vcf.gz
Join Biallelic to Multiallelic
bash
bcftools norm -m+any input.vcf.gz -Oz -o merged.vcf.gz
Split Options
| Option | Description |
|---|---|
-m-any | Split all multiallelic sites |
-m-snps | Split multiallelic SNPs only |
-m-indels | Split multiallelic indels only |
-m-both | Split SNPs and indels separately |
-m+any | Join biallelic sites into multiallelic |
-m+snps | Join biallelic SNPs |
-m+indels | Join biallelic indels |
-m+both | Join SNPs and indels separately |
Combined Normalization
Standard Normalization Pipeline
bash
bcftools norm -f reference.fa -m-any input.vcf.gz -Oz -o normalized.vcf.gz bcftools index normalized.vcf.gz
This:
- •Left-aligns indels
- •Splits multiallelic sites
Remove Duplicates After Splitting
bash
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz -Oz -o normalized.vcf.gz
Duplicate removal options (-d):
- •
exact- Remove exact duplicates - •
snps- Remove duplicate SNPs - •
indels- Remove duplicate indels - •
both- Remove duplicate SNPs and indels - •
all- Remove all duplicates - •
none- Keep duplicates (default)
Fixing Reference Alleles
Fix Mismatches from Reference
bash
bcftools norm -f reference.fa -c s input.vcf.gz -Oz -o fixed.vcf.gz
This sets REF alleles to match the reference genome.
Exclude Mismatches
bash
bcftools norm -f reference.fa -c x input.vcf.gz -Oz -o clean.vcf.gz
Removes variants where REF doesn't match reference.
Atomize Complex Variants
Split MNPs to SNPs
bash
bcftools norm --atomize input.vcf.gz -Oz -o atomized.vcf.gz
Before:
code
chr1 100 . ATG GCA 30 PASS
After:
code
chr1 100 . A G 30 PASS chr1 101 . T C 30 PASS chr1 102 . G A 30 PASS
Atomize and Left-Align
bash
bcftools norm -f reference.fa --atomize input.vcf.gz -Oz -o atomized.vcf.gz
Old to New Format
Update VCF Version
bash
bcftools norm --old-rec-tag OLD input.vcf.gz -Oz -o updated.vcf.gz
Tags original record for reference.
Common Workflows
Before Comparing Callers
bash
# Normalize both VCFs the same way
for vcf in caller1.vcf.gz caller2.vcf.gz; do
base=$(basename "$vcf" .vcf.gz)
bcftools norm -f reference.fa -m-any "$vcf" -Oz -o "${base}.norm.vcf.gz"
bcftools index "${base}.norm.vcf.gz"
done
# Now compare
bcftools isec -p comparison caller1.norm.vcf.gz caller2.norm.vcf.gz
Before Database Annotation
bash
bcftools norm -f reference.fa -m-any variants.vcf.gz -Oz -o normalized.vcf.gz bcftools index normalized.vcf.gz # Now annotate against dbSNP, ClinVar, etc.
Prepare for GWAS
bash
bcftools norm -f reference.fa -m-any -d exact input.vcf.gz | \
bcftools view -v snps -Oz -o gwas_ready.vcf.gz
bcftools index gwas_ready.vcf.gz
cyvcf2 Normalization Check
Check if Variants Need Normalization
python
from cyvcf2 import VCF
def needs_normalization(variant):
# Check for multiallelic
if len(variant.ALT) > 1:
return True
# Check for complex variants (potential MNPs)
ref, alt = variant.REF, variant.ALT[0]
if len(ref) > 1 and len(alt) > 1 and len(ref) == len(alt):
return True
return False
count = 0
for variant in VCF('input.vcf.gz'):
if needs_normalization(variant):
count += 1
print(f'Variants needing normalization: {count}')
Count Multiallelic Sites
python
from cyvcf2 import VCF
multiallelic = 0
total = 0
for variant in VCF('input.vcf.gz'):
total += 1
if len(variant.ALT) > 1:
multiallelic += 1
print(f'Total variants: {total}')
print(f'Multiallelic sites: {multiallelic}')
print(f'Percentage: {multiallelic/total*100:.1f}%')
Quick Reference
| Task | Command |
|---|---|
| Left-align indels | bcftools norm -f ref.fa in.vcf.gz |
| Split multiallelic | bcftools norm -m-any in.vcf.gz |
| Join to multiallelic | bcftools norm -m+any in.vcf.gz |
| Full normalization | bcftools norm -f ref.fa -m-any in.vcf.gz |
| Fix REF alleles | bcftools norm -f ref.fa -c s in.vcf.gz |
| Remove duplicates | bcftools norm -d exact in.vcf.gz |
| Atomize MNPs | bcftools norm --atomize in.vcf.gz |
Common Errors
| Error | Cause | Solution |
|---|---|---|
REF does not match | Wrong reference | Use same reference as caller |
not sorted | Unsorted input | Run bcftools sort first |
duplicate records | Same position twice | Use -d to remove |
Related Skills
- •variant-calling - Generate VCF files
- •filtering-best-practices - Filter after normalization
- •vcf-manipulation - Compare normalized VCFs
- •variant-annotation - Annotate normalized variants