Functional Enrichment (HOMER + R)

Overview

•Validate input: Accept BED/peak files with genomic coordinates or gene lists; check format and genome assembly.
•Map regions to genes: Convert regions to a unique gene set using HOMER annotatePeaks.pl.
•Run GO enrichment: Use HOMER findGO.pl (or annotatePeaks.pl -go) for BP/MF/CC.
•Run KEGG enrichment: Use HOMER findGO.pl -kegg (or annotatePeaks.pl -kegg).
•Collect outputs: Save tidy tables for downstream plotting and a compact summary of top terms.
•Visualize in R: Create barplots and dotplots (GO/KEGG) with ggplot2 from standardized outputs.
•QC & troubleshooting: Provide checks for genome mismatch, chromosome naming, and low-signal inputs.

Inputs & Outputs

Inputs (choose one):

Option 1: Input is a genomic region file (BED/narrowPeak/broadPeak)

Genomic region formats supported:

•BED files: Standard genomic interval format
•narrowPeak: narrow peak format
•broadPeak: broad peak format

Option 2: Input is a gene list (txt)

•gene_list.txt with one official gene symbol per line (no header). And an optional gene_list_background.txt with one official gene symbol per line (no header).

Outputs (directory layout):

bash

${sample}_functional_enrichment/
    results/
      ${sample}.anno_genomic_features.txt
      ${sample}.anno_genomic_features_stats.txt
      biological_process.txt
      cellular_component.txt  
      molecular_function.txt  

      kegg.txt                
      biocyc.txt              
      chromosome.txt  
      cosmic.txt
      interactions.txt  
      interpro.txt
      gene3d.txt
      pathwayInteractionDB.txt
      pfam.txt
      prints.txt    
      prosite.txt   
      reactome.txt
      smpdb.txt
      wikipathways.txt

      gwas.txt          
      lipidmaps.txt           
      msigdb.txt                
      smart.txt

    tables/
      ${sample}.gene_list.txt
      go_bp.tsv
      go_mf.tsv
      go_cc.tsv
      kegg.tsv
    logs/
      ${sample}.anno_genomic_features.log # if genome region file is provided
      findGO.log

Decision Tree

Step 0 — Gather Required Information from the User

Before calling any tool, ask the user:

•Sample name (sample): used as prefix and for the output directory ${sample}_functional_enrichment.
•
Genome assembly (genome): e.g. hg38, mm10, danRer11.
- •Never guess or auto-detect.

Step 1: Initialize Project

•Make director for this project:

Call:

•mcp__project-init-tools__project_init

with:

•sample: the user-provided sample name
•task: de_novo_motif_discovery

The tool will:

•Create ${sample}_functional_enrichment directory.
•Get the full path of the ${sample}_functional_enrichment directory, which will be used as ${proj_dir}.

Step 2: Prepare genome file for homer

Call:

•mcp__homer-tools__check_genome_installation

With:

•genome: the user-provided genome assembly, e.g. hg38, mm10, danRer11

The tool will:

•Check if the genome is installed in HOMER.
•If not, install the genome.

Step 3 (Optional): Standardize chromosome names for BED files

This step is optional. Only perform this step if the input file is a BED file. If the input file is a gene list, skip this step.

From 1 format to chr1 format From MT format to chrM format

Call:

•mcp__file-format-tools__standardize_bed_chrom_names

with:

•input_bed: the user-provided BED file
•output_bed: the path to save the standardized BED file

The tool will:

•Standardize the chromosome names in the BED file.
•Return the path of the standardized BED file.

Step 4 (Optional): Convert gene ID to gene symbol

This step is optional. Only perform this step if the input file is a gene list file. If the input file is a BED file, skip this step.

Call:

•mcp__mygene-tools__convert_gene_ids_mygene

With:

•input_ids_file: the user-provided gene list file. May end with .txt.
•scopes: the source ID type for mygene (e.g., 'ensembl.gene', 'symbol', 'entrezgene', 'uniprot', or a comma-separated list).
•fields: the comma-separated target fields to retrieve from mygene (e.g., 'symbol,ensembl.gene,uniprot,entrezgene').
•species: the species for mygene (e.g., 'human', 'mouse', 'zebrafish', or NCBI taxon ID like '9606').
•out_file: the path to save the converted gene list file. In this skill, it is the full path of the ${sample}_functional_enrichment directory returned by mcp__project-init-tools__project_init
•batch_size: the batch size for mygene.querymany (default 1000).

The tool will:

•Convert the gene ID to gene symbol.
•Return the path of the converted gene list file.

Step 5: GO enrichment analysis

Option 1: from genomic regions file

Only if the input file is a BED file. If the input file is a gene list, call tools in Option 2.

•annotate the genomic regions using Homer's annotatePeaks.pl with -go option. If user also provides a background genome region file, like a control peak file, also call this tool for the background genome region file. Use a different ${sample} as the sample name for the background sample.

Call: mcp__homer-tools__annotate_genomic_features

With:

•sample: the user-provided sample name
•proj_dir: directory to save the genomic feature annotation results. In this skill, it is the full path of the ${sample}_functional_enrichment directory returned by mcp__project-init-tools__project_init
•regions_bed: the user-provided regions file in BED format. May end with .bed, .narrowPeak, .broadPeak, etc.
•genome: the user-provided genome assembly, e.g. hg38, mm10, danRer11
•ann: "custom homer annotation file (created by assignGenomeAnnotation.pl), (default: None).
•size_given: keep original region sizes (default: True)
•cpg: include CpG information (default: False)
•go: True to perform GO enrichment analysis.

The tool will:

•Annotate the genomic regions using Homer's annotatePeaks.pl.
•
Return the path of the annotated regions file under ${proj_dir}/results/ directory, and the path to the log file under ${proj_dir}/logs/ directory.
- •${proj_dir}/results/${sample}.anno_genomic_features.txt
- •${proj_dir}/results/${sample}.anno_genomic_features_stats.txt
- •${proj_dir}/logs/${sample}.anno_genomic_features.log

•(optional) extract the genes from the annotated regions file if neccessary for future analysis or the target gene list is requested by user. If not requested, skip this step.

Call: mcp__file-format-tools__extract_gene_list

With:

•sample: the user-provided sample name
•proj_dir: directory to save the genomic feature annotation results. In this skill, it is the full path of the ${sample}_functional_enrichment directory returned by mcp__project-init-tools__project_init

The tool will:

•Extract the genes from the annotated regions file.
•
Return the path of the gene list file under ${proj_dir}/tables/ directory.
- •${proj_dir}/tables/${sample}.gene_list.txt

Option 2: from gene list file

Only if the input file is a gene list file. If the input file is a BED file, call tools in Option 1.

Call: mcp__homer-tools__gene_function_enrichment

With:

•sample: the user-provided sample name
•proj_dir: directory to save the GO & KEGG enrichment results. In this skill, it is the full path of the ${sample}_functional_enrichment directory returned by mcp__project-init-tools__project_init
•gene_list_file: the user-provided gene list file. May end with .txt.
•organism: the user-provided organism name, e.g. human, mouse, zebrafish, etc.
•background_gene_list_file: the user-provided background gene list file. May end with .txt. If not provided, set this parameter to None.

The tool will:

•Find the GO enrichment for the gene list.
•
Return the path of the GO & KEGG enrichment results under ${proj_dir}/results/ directory.
- •${proj_dir}/results/biological_process.txt
- •${proj_dir}/results/kegg.txt
- •... other GO and KEGG enrichment results files.
•
Return the path of the log file under ${proj_dir}/logs/ directory.
- •${proj_dir}/logs/${sample}.find_go_and_kegg_enrichment.log

Alternative direct from BED
annotatePeaks.pl peaks.bed hg38 -go results/{run}/tables/go_dir -genomeOntology
annotatePeaks.pl peaks.bed hg38 -kegg results/{run}/tables/kegg_dir

Notes & Best Practices

•Genome & naming: Ensure the HOMER genome key matches the species; chromosome naming must be consistent (chr1 vs 1).
•BED format: Tab-delimited, ≥3 columns, 0-based coordinates, no header.
•Multiple testing: Prefer FDR (BH) if provided; otherwise fallback to P-value.
•Background set: -bg helps reduce bias; choose a reasonable universe (e.g., all expressed or all accessible regions → genes).
•Direct-from-BED: annotatePeaks.pl -go/-kegg is convenient; the gene-list route yields uniform TSVs for plotting.

Troubleshooting

•Many NAs after annotation: Check genome version, chromosome naming, BED formatting, and headers.
•Empty/weak enrichment: Ensure sufficient genes (suggest ≥50), verify species of symbols, tune thresholds or background.
•Column name drift: HOMER versions may differ; adjust R column mappings if needed.