Genotype Imputation

Name: bio-phasing-imputation-genotype-imputation
Rating: 92
Author: GPTomics

Beagle Imputation

bash

# Basic imputation
java -jar beagle.jar \
    gt=study.vcf.gz \
    ref=reference_panel.vcf.gz \
    map=genetic_map.txt \
    out=imputed

# Output: imputed.vcf.gz with imputed genotypes

Beagle with Options

bash

java -Xmx32g -jar beagle.jar \
    gt=study.vcf.gz \
    ref=reference_panel.vcf.gz \
    map=genetic_map.txt \
    out=imputed \
    nthreads=8 \
    gp=true \              # Output genotype probabilities
    ap=true \              # Output allele probabilities
    impute=true \          # Perform imputation (default)
    ne=20000               # Effective population size

Impute Per Chromosome

bash

for chr in {1..22}; do
    java -Xmx32g -jar beagle.jar \
        gt=study.chr${chr}.vcf.gz \
        ref=ref.chr${chr}.vcf.gz \
        map=genetic_maps/plink.chr${chr}.GRCh38.map \
        out=imputed.chr${chr} \
        gp=true \
        nthreads=8
done

# Concatenate
bcftools concat imputed.chr*.vcf.gz -Oz -o imputed.all.vcf.gz
bcftools index imputed.all.vcf.gz

IMPUTE5 (Alternative)

bash

# Newer IMPUTE software
impute5 \
    --h reference.bcf \
    --m genetic_map.txt \
    --g study.vcf.gz \
    --r chr22 \
    --o imputed.chr22.vcf.gz \
    --threads 8

Minimac4 (Michigan Imputation Server)

bash

# Often used via web server, but can run locally
minimac4 \
    --refHaps reference.m3vcf.gz \
    --haps study.vcf.gz \
    --prefix imputed \
    --format GT,DS,GP \
    --cpus 8

Input Preparation

bash

# 1. Align to reference (strand, allele order)
bcftools +fixref study.vcf.gz -Oz -o fixed.vcf.gz -- \
    -f reference.fa -m flip

# 2. Filter to sites in reference
bcftools isec -n=2 -w1 fixed.vcf.gz reference_sites.vcf.gz \
    -Oz -o study_overlap.vcf.gz

# 3. Phase first (if not already phased)
java -jar beagle.jar gt=study_overlap.vcf.gz out=phased

# 4. Then impute
java -jar beagle.jar gt=phased.vcf.gz ref=reference.vcf.gz out=imputed

Extract Imputation Quality

bash

# INFO/DR2 or INFO/R2 contains imputation quality
bcftools query -f '%CHROM\t%POS\t%ID\t%INFO/DR2\n' imputed.vcf.gz > info_scores.txt

# Filter by quality
bcftools view -i 'INFO/DR2 > 0.3' imputed.vcf.gz -Oz -o imputed_filtered.vcf.gz

Output Formats

Format	Field	Description
GT	0\|0, 0\|1, 1\|1	Hard-called genotype
DS	0.0-2.0	Dosage (expected ALT allele count)
GP	0.0-1.0,0.0-1.0,0.0-1.0	Genotype probabilities (AA,AB,BB)
DR2/R2	0.0-1.0	Imputation quality score

Using Dosages for GWAS

python

import pandas as pd

# Extract dosages
# bcftools query -f '%CHROM\t%POS\t%ID[\t%DS]\n' imputed.vcf.gz > dosages.txt

dosages = pd.read_csv('dosages.txt', sep='\t')

# Dosage-based association (treats uncertainty)
# Use --dosage in PLINK2 or similar

bash

# PLINK2 with dosages
plink2 --vcf imputed.vcf.gz dosage=DS \
    --glm \
    --pheno phenotypes.txt \
    --out gwas_results

Quality Thresholds

Analysis	Minimum INFO/R2
GWAS discovery	0.3
GWAS fine-mapping	0.8
Meta-analysis	0.5
Polygenic scores	0.9

Key Parameters

Parameter	Beagle	Description
gt	input VCF	Study genotypes
ref	reference VCF	Reference panel
map	genetic map	Recombination map
gp	true/false	Output genotype probs
ne	20000	Effective population size
nthreads	N	CPU threads
window	40	Window size (cM)

Imputation Servers

For large-scale imputation, consider web-based servers:

•Michigan Imputation Server: imputationserver.sph.umich.edu
•TOPMed Imputation Server: imputation.biodatacatalyst.nhlbi.nih.gov
•Sanger Imputation Server: imputation.sanger.ac.uk

bash

# Prepare input for server
# Most require VCF.GZ per chromosome
for chr in {1..22}; do
    bcftools view -r chr${chr} study.vcf.gz -Oz -o study.chr${chr}.vcf.gz
done

Related Skills

•phasing-imputation/haplotype-phasing - Pre-phasing step
•phasing-imputation/reference-panels - Reference panel setup
•phasing-imputation/imputation-qc - Quality control
•population-genetics/association-testing - GWAS with imputed data