ChIP-seq Quality Control
Quality metrics for assessing ChIP-seq experiment success and replicate reproducibility.
FRiP (Fraction of Reads in Peaks)
Measures the proportion of reads falling within called peaks. Higher values indicate stronger enrichment.
Calculate FRiP with bedtools
bash
# Count reads in peaks reads_in_peaks=$(bedtools intersect -a chip.bam -b peaks.narrowPeak -u | samtools view -c -) total_reads=$(samtools view -c -F 260 chip.bam) # Calculate FRiP frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc) echo "FRiP: $frip"
Calculate FRiP with featureCounts
bash
# Convert peaks to SAF format
awk 'BEGIN{OFS="\t"} {print $4, $1, $2, $3, "."}' peaks.narrowPeak > peaks.saf
# Count reads in peaks
featureCounts -a peaks.saf -F SAF -o peak_counts.txt chip.bam
# FRiP from summary
grep -v "^#" peak_counts.txt.summary
Calculate FRiP with pysam
python
import pysam
import pybedtools
def calculate_frip(bam_file, peak_file):
bam = pysam.AlignmentFile(bam_file, 'rb')
total_reads = bam.count(read_callback=lambda r: not r.is_unmapped and not r.is_secondary)
peaks = pybedtools.BedTool(peak_file)
reads_in_peaks = 0
for peak in peaks:
reads_in_peaks += bam.count(peak.chrom, peak.start, peak.end)
frip = reads_in_peaks / total_reads
return frip
frip = calculate_frip('chip.bam', 'peaks.narrowPeak')
print(f'FRiP: {frip:.4f}')
FRiP Thresholds
| Target | Minimum FRiP | Good FRiP |
|---|---|---|
| TF (narrow) | 0.01 | > 0.05 |
| Histone (broad) | 0.10 | > 0.20 |
| H3K4me3 | 0.05 | > 0.15 |
| H3K27ac | 0.05 | > 0.10 |
Cross-Correlation Analysis (NSC/RSC)
Strand cross-correlation analysis measures ChIP enrichment by calculating correlation between read coverage on forward and reverse strands.
Run phantompeakqualtools
bash
# Run SPP cross-correlation analysis
Rscript run_spp.R \
-c=chip.bam \
-savp=chip_cc.pdf \
-out=chip_cc.txt \
-odir=qc/
# Output columns:
# 1: filename
# 2: numReads
# 3: estFragLen (estimated fragment length)
# 4: corr_estFragLen
# 5: phantomPeak
# 6: corr_phantomPeak
# 7: argmin_corr (minimum strand shift)
# 8: min_corr
# 9: NSC (Normalized Strand Coefficient)
# 10: RSC (Relative Strand Coefficient)
# 11: QualityTag
Interpret NSC and RSC
bash
# Parse results
awk -F'\t' '{
print "Fragment length:", $3
print "NSC:", $9
print "RSC:", $10
print "Quality:", $11
}' chip_cc.txt
NSC/RSC Thresholds
| Metric | Marginal | Acceptable | Ideal |
|---|---|---|---|
| NSC | < 1.05 | 1.05 - 1.1 | > 1.1 |
| RSC | < 0.8 | 0.8 - 1.0 | > 1.0 |
| QualityTag | -2 | 0 | 1 or 2 |
Plot Cross-Correlation in R
r
library(spp)
chip_data <- read.bam.tags('chip.bam')
binding_characteristics <- get.binding.characteristics(chip_data, srange=c(50, 500), bin=5)
# Cross-correlation plot
pdf('cc_plot.pdf')
plot(binding_characteristics$cross.correlation, type='l',
xlab='Strand shift', ylab='Cross-correlation')
abline(v=binding_characteristics$peak$x, col='red')
dev.off()
# Extract metrics
print(paste('Fragment length:', binding_characteristics$peak$x))
Library Complexity (NRF, PBC1, PBC2)
Measures of library complexity to detect PCR amplification issues.
Calculate with bedtools
bash
# NRF: Non-Redundant Fraction (unique reads / total reads)
total=$(samtools view -c -F 260 chip.bam)
unique=$(samtools view -F 260 chip.bam | cut -f1-4 | sort -u | wc -l)
nrf=$(echo "scale=4; $unique / $total" | bc)
echo "NRF: $nrf"
# PBC1: PCR Bottleneck Coefficient 1 (M1/Mdistinct)
# M1 = locations with exactly 1 read
# Mdistinct = distinct genomic locations
bedtools bamtobed -i chip.bam | \
awk '{print $1":"$2"-"$3}' | \
sort | uniq -c | \
awk '{
if($1==1) m1++
mdist++
} END {
print "M1:", m1
print "Mdistinct:", mdist
print "PBC1:", m1/mdist
}'
Library Complexity Thresholds
| Metric | Severe | Mild | None |
|---|---|---|---|
| NRF | < 0.5 | 0.5 - 0.8 | > 0.8 |
| PBC1 | < 0.5 | 0.5 - 0.8 | > 0.8 |
| PBC2 | < 1 | 1 - 3 | > 3 |
IDR (Irreproducibility Discovery Rate)
Measures consistency between biological replicates by comparing ranked peak lists.
Run IDR Analysis
bash
# Call peaks on each replicate
macs3 callpeak -t rep1.bam -c input.bam -n rep1 -g hs
macs3 callpeak -t rep2.bam -c input.bam -n rep2 -g hs
# Sort by signal value (column 7)
sort -k7,7nr rep1_peaks.narrowPeak > rep1_sorted.narrowPeak
sort -k7,7nr rep2_peaks.narrowPeak > rep2_sorted.narrowPeak
# Run IDR
idr --samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
--input-file-type narrowPeak \
--rank signal.value \
--output-file idr_output.txt \
--plot idr_plot.pdf \
--log-output-file idr.log
IDR with Pooled Peaks
bash
# Call peaks on pooled data
samtools merge -f pooled.bam rep1.bam rep2.bam
macs3 callpeak -t pooled.bam -c input.bam -n pooled -g hs
# Run IDR with oracle
idr --samples rep1_sorted.narrowPeak rep2_sorted.narrowPeak \
--peak-list pooled_peaks.narrowPeak \
--input-file-type narrowPeak \
--rank signal.value \
--output-file idr_oracle.txt
Interpret IDR Results
bash
# Count peaks at different IDR thresholds awk '$5 >= 540' idr_output.txt | wc -l # IDR < 0.05 (conservative) awk '$5 >= 415' idr_output.txt | wc -l # IDR < 0.1 (optimal) # IDR output columns: # 1-3: chr, start, end # 4: name # 5: scaled IDR (-125 * log2(IDR)) # 6: strand # 7: signal (from rep1) # 8: signal (from rep2) # 9: local IDR # 10: global IDR
IDR Self-Consistency Check
bash
# Split one sample and check self-consistency
samtools view -s 0.5 chip.bam -b > pseudo_rep1.bam
samtools view -s 2.5 chip.bam -b > pseudo_rep2.bam
# Call peaks on pseudo-replicates
macs3 callpeak -t pseudo_rep1.bam -c input.bam -n pseudo1 -g hs
macs3 callpeak -t pseudo_rep2.bam -c input.bam -n pseudo2 -g hs
# Run IDR
idr --samples pseudo1_peaks.narrowPeak pseudo2_peaks.narrowPeak \
--input-file-type narrowPeak \
--output-file self_idr.txt
IDR Quality Guidelines
| Comparison | Expected IDR Peaks | Notes |
|---|---|---|
| True replicates | > 70% of pooled | Biological concordance |
| Pseudo-replicates | > 80% of sample | Technical consistency |
| Rep vs Pooled | ~100% of rep peaks | Subset relationship |
deepTools QC Metrics
plotFingerprint
bash
# Assess enrichment with fingerprint plot
plotFingerprint \
-b chip.bam input.bam \
--labels ChIP Input \
-o fingerprint.pdf \
--outRawCounts fingerprint.tab \
--outQualityMetrics fingerprint_qc.txt
# Good ChIP shows curve shifted right of diagonal
# Input follows diagonal
computeMatrix and plotProfile
bash
# TSS enrichment
computeMatrix reference-point \
-S chip.bw \
-R genes.bed \
--referencePoint TSS \
-a 3000 -b 3000 \
-o matrix.gz
plotProfile \
-m matrix.gz \
-o tss_enrichment.pdf \
--perGroup
plotCorrelation
bash
# Sample correlation
multiBamSummary bins \
-b rep1.bam rep2.bam rep3.bam \
-o results.npz
plotCorrelation \
-in results.npz \
--corMethod spearman \
--whatToPlot heatmap \
-o correlation.pdf \
--outFileCorMatrix correlation.tab
Complete QC Pipeline
bash
#!/bin/bash sample=$1 input=$2 peaks=$3 echo "=== ChIP-seq QC Report: $sample ===" > qc_report.txt # FRiP reads_in_peaks=$(bedtools intersect -a $sample -b $peaks -u | samtools view -c -) total_reads=$(samtools view -c -F 260 $sample) frip=$(echo "scale=4; $reads_in_peaks / $total_reads" | bc) echo "FRiP: $frip" >> qc_report.txt # Cross-correlation Rscript run_spp.R -c=$sample -out=cc.txt -odir=. nsc=$(cut -f9 cc.txt) rsc=$(cut -f10 cc.txt) echo "NSC: $nsc" >> qc_report.txt echo "RSC: $rsc" >> qc_report.txt # Library complexity unique=$(samtools view -F 260 $sample | cut -f1-4 | sort -u | wc -l) nrf=$(echo "scale=4; $unique / $total_reads" | bc) echo "NRF: $nrf" >> qc_report.txt # Fingerprint plotFingerprint -b $sample $input -o fingerprint.pdf --outQualityMetrics fingerprint_qc.txt cat qc_report.txt
QC Summary Table
| Metric | Tool | Ideal Value |
|---|---|---|
| FRiP | bedtools/featureCounts | > 0.05 (TF), > 0.1 (histone) |
| NSC | phantompeakqualtools | > 1.1 |
| RSC | phantompeakqualtools | > 1.0 |
| NRF | samtools/bedtools | > 0.8 |
| PBC1 | bedtools | > 0.8 |
| IDR (replicates) | idr | > 70% concordance |
Related Skills
- •peak-calling - Call peaks before QC analysis
- •alignment-files - BAM statistics and filtering
- •differential-binding - Compare conditions after QC
- •atac-seq/atac-qc - Similar QC for ATAC-seq