BioGeoBEARS Biogeographic Analysis
Overview
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) performs probabilistic inference of ancestral geographic ranges on phylogenetic trees. This skill helps set up complete biogeographic analyses by:
- •Validating and reformatting input files (phylogenetic tree and geographic distribution data)
- •Generating organized analysis folder structure
- •Creating customized RMarkdown analysis scripts
- •Guiding users through parameter selection and model choices
- •Producing publication-ready visualizations
When to Use This Skill
Use this skill when users request:
- •"Analyze biogeography on my phylogeny"
- •"Reconstruct ancestral ranges for my species"
- •"Run BioGeoBEARS analysis"
- •"Which areas did my ancestors occupy?"
- •"Test biogeographic models (DEC, DIVALIKE, BAYAREALIKE)"
The skill triggers when users mention phylogenetic biogeography, ancestral area reconstruction, or provide tree + distribution data.
Required Inputs
Users must provide:
- •
Phylogenetic tree (Newick format, .nwk, .tre, or .tree file)
- •Must be rooted
- •Tip labels will be matched to geography file
- •Branch lengths required
- •
Geographic distribution data (any tabular format)
- •Species names (matching tree tips)
- •Presence/absence data for different geographic areas
- •Can be CSV, TSV, Excel, or already in PHYLIP format
Workflow
Step 1: Gather Information
When a user requests a BioGeoBEARS analysis, ask for:
- •
Input file paths:
- •"What is the path to your phylogenetic tree file?"
- •"What is the path to your geographic distribution file?"
- •
Analysis parameters (if not specified):
- •Maximum range size (how many areas can a species occupy simultaneously?)
- •Which models to compare (default: all six - DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J)
- •Output directory name (default: "biogeobears_analysis")
Use the AskUserQuestion tool to gather this information efficiently:
Example questions: - "Maximum range size" - options based on number of areas (e.g., for 4 areas: "All 4 areas", "3 areas", "2 areas") - "Models to compare" - options: "All 6 models (recommended)", "Only base models (DEC, DIVALIKE, BAYAREALIKE)", "Only +J models", "Custom selection" - "Visualization type" - options: "Pie charts (show probabilities)", "Text labels (show most likely states)", "Both"
Step 2: Validate and Prepare Input Files
Validate Tree File
Use the Read tool to check the tree file:
# In R, basic validation:
library(ape)
tr <- read.tree("path/to/tree.nwk")
print(paste("Tips:", length(tr$tip.label)))
print(paste("Rooted:", is.rooted(tr)))
print(tr$tip.label) # Check species names
Verify:
- •File can be parsed as Newick
- •Tree is rooted (if not, ask user which outgroup to use)
- •Note the tip labels for geography file validation
Validate and Reformat Geography File
Use scripts/validate_geography_file.py to validate or reformat the geography file.
If file is already in PHYLIP format (starts with numbers):
python scripts/validate_geography_file.py path/to/geography.txt --validate --tree path/to/tree.nwk
This checks:
- •Correct tab delimiters
- •Species names match tree tips
- •Binary codes are correct length
- •No spaces in species names or binary codes
If file is in CSV/TSV format (needs reformatting):
python scripts/validate_geography_file.py path/to/distribution.csv --reformat -o geography.data --delimiter ","
Or for tab-delimited:
python scripts/validate_geography_file.py path/to/distribution.txt --reformat -o geography.data --delimiter tab
The script will:
- •Detect area names from header row
- •Convert presence/absence data to binary (handles "1", "present", "TRUE", etc.)
- •Remove spaces from species names (replace with underscores)
- •Create properly formatted PHYLIP file
Always validate the reformatted file before proceeding:
python scripts/validate_geography_file.py geography.data --validate --tree path/to/tree.nwk
Step 3: Set Up Analysis Folder Structure
Create an organized directory for the analysis:
biogeobears_analysis/ ├── input/ │ ├── tree.nwk # Original or copied tree │ ├── geography.data # Validated/reformatted geography file │ └── original_data/ # Original input files │ ├── original_tree.nwk │ └── original_distribution.csv ├── scripts/ │ └── run_biogeobears.Rmd # Generated RMarkdown script ├── results/ # Created by analysis (output directory) │ ├── [MODEL]_result.Rdata # Saved model results │ └── plots/ # Visualization outputs │ ├── [MODEL]_pie.pdf │ └── [MODEL]_text.pdf └── README.md # Analysis documentation
Create this structure programmatically:
mkdir -p biogeobears_analysis/input/original_data mkdir -p biogeobears_analysis/scripts mkdir -p biogeobears_analysis/results/plots # Copy files cp path/to/tree.nwk biogeobears_analysis/input/ cp geography.data biogeobears_analysis/input/ cp original_files biogeobears_analysis/input/original_data/
Step 4: Generate RMarkdown Analysis Script
Use the template at scripts/biogeobears_analysis_template.Rmd and customize it with user parameters.
Copy and customize the template:
cp scripts/biogeobears_analysis_template.Rmd biogeobears_analysis/scripts/run_biogeobears.Rmd
Create a parameter file or modify the YAML header in the Rmd to use the user's specific settings:
Example customization via R code:
# Edit YAML parameters programmatically or provide as params when rendering
rmarkdown::render(
"biogeobears_analysis/scripts/run_biogeobears.Rmd",
params = list(
tree_file = "../input/tree.nwk",
geog_file = "../input/geography.data",
max_range_size = 4,
models = "DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J",
output_dir = "../results"
),
output_file = "../results/biogeobears_report.html"
)
Or create a run script:
# biogeobears_analysis/run_analysis.sh
#!/bin/bash
cd "$(dirname "$0")/scripts"
R -e "rmarkdown::render('run_biogeobears.Rmd', params = list(
tree_file = '../input/tree.nwk',
geog_file = '../input/geography.data',
max_range_size = 4,
models = 'DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J',
output_dir = '../results'
), output_file = '../results/biogeobears_report.html')"
Step 5: Create README Documentation
Generate a README.md in the analysis directory explaining:
- •What files are present
- •How to run the analysis
- •What parameters were used
- •How to interpret results
Example:
# BioGeoBEARS Analysis
## Overview
Biogeographic analysis of [NUMBER] species across [NUMBER] geographic areas.
## Input Data
- **Tree**: `input/tree.nwk` ([NUMBER] tips)
- **Geography**: `input/geography.data` ([NUMBER] species × [NUMBER] areas)
- **Areas**: [A, B, C, ...]
## Parameters
- Maximum range size: [NUMBER]
- Models tested: [LIST]
## Running the Analysis
### Option 1: Using RMarkdown directly
```r
library(rmarkdown)
render("scripts/run_biogeobears.Rmd",
output_file = "../results/biogeobears_report.html")
Option 2: Using the run script
bash run_analysis.sh
Outputs
Results will be saved in results/:
- •
biogeobears_report.html- Full analysis report with visualizations - •
[MODEL]_result.Rdata- Saved R objects for each model - •
plots/[MODEL]_pie.pdf- Ancestral range reconstructions (pie charts) - •
plots/[MODEL]_text.pdf- Ancestral range reconstructions (text labels)
Interpreting Results
The HTML report includes:
- •Model Comparison - AIC scores, AIC weights, best-fit model
- •Parameter Estimates - Dispersal (d), extinction (e), founder-event (j) rates
- •Likelihood Ratio Tests - Statistical comparisons of nested models
- •Ancestral Range Plots - Visualizations on phylogeny
- •Session Info - R package versions for reproducibility
Model Descriptions
- •DEC: Dispersal-Extinction-Cladogenesis (general-purpose)
- •DIVALIKE: Emphasizes vicariance
- •BAYAREALIKE: Emphasizes sympatric speciation
- •+J: Adds founder-event speciation parameter
See references/biogeobears_details.md for detailed model descriptions.
Installation Requirements
# Install BioGeoBEARS
install.packages("rexpokit")
install.packages("cladoRcpp")
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS")
# Other packages
install.packages(c("ape", "rmarkdown", "knitr", "kableExtra"))
### Step 6: Provide User Instructions After setting up the analysis, provide clear instructions to the user:
Analysis Setup Complete!
Directory structure created at: biogeobears_analysis/
📁 Files created: ✓ input/tree.nwk - Phylogenetic tree ([N] tips) ✓ input/geography.data - Geographic distribution data (validated) ✓ scripts/run_biogeobears.Rmd - RMarkdown analysis script ✓ README.md - Documentation and instructions ✓ run_analysis.sh - Convenience script to run analysis
📋 Next steps:
- •
Review the README.md for analysis details
- •
Install BioGeoBEARS if not already installed:
rinstall.packages("rexpokit") install.packages("cladoRcpp") library(devtools) devtools::install_github(repo="nmatzke/BioGeoBEARS") - •
Run the analysis:
bashcd biogeobears_analysis bash run_analysis.sh
Or in R:
rsetwd("biogeobears_analysis") rmarkdown::render("scripts/run_biogeobears.Rmd", output_file = "../results/biogeobears_report.html") - •
View results:
- •Open results/biogeobears_report.html in web browser
- •Check results/plots/ for PDF visualizations
⏱️ Expected runtime: [ESTIMATE based on tree size]
- •Small trees (<50 tips): 5-15 minutes
- •Medium trees (50-100 tips): 15-60 minutes
- •Large trees (>100 tips): 1-4 hours
💡 The HTML report includes model comparison, parameter estimates, and visualization of ancestral ranges on your phylogeny.
## Analysis Parameter Guidance
When users ask for guidance on parameters, consult `references/biogeobears_details.md` and provide recommendations:
### Maximum Range Size
**Ask**: "What's the maximum number of areas a species in your group can realistically occupy?"
Common approaches:
- **Conservative**: Number of areas - 1 (prevents unrealistic cosmopolitan ancestral ranges)
- **Permissive**: All areas (if biologically plausible)
- **Data-driven**: Maximum observed in extant species
**Impact**: Larger values increase computational time exponentially
### Model Selection
**Default recommendation**: Run all 6 models for comprehensive comparison
- DEC, DIVALIKE, BAYAREALIKE (base models)
- DEC+J, DIVALIKE+J, BAYAREALIKE+J (+J variants)
**Rationale**:
- Model comparison is key to inference
- +J parameter is often significant
- Small additional computational cost
If computation is a concern, suggest starting with DEC and DEC+J.
### Visualization Options
**Pie charts** (`plotwhat = "pie"`):
- Show probability distributions across all possible states
- Better for conveying uncertainty
- Can be cluttered with many areas
**Text labels** (`plotwhat = "text"`):
- Show only maximum likelihood state
- Cleaner, easier to read
- Doesn't show uncertainty
**Recommendation**: Generate both in the analysis (template does this automatically)
## Common Issues and Troubleshooting
### Species Name Mismatches
**Symptom**: Error about species in tree not in geography file (or vice versa)
**Solution**: Use the validation script with `--tree` option to identify mismatches, then either:
1. Edit the geography file to match tree tip labels
2. Edit tree tip labels to match geography file
3. Remove species that aren't in both
### Tree Not Rooted
**Symptom**: Error about unrooted tree
**Solution**:
```r
library(ape)
tr <- read.tree("tree.nwk")
tr <- root(tr, outgroup = "outgroup_species_name")
write.tree(tr, "tree_rooted.nwk")
Ask user which species to use as outgroup.
Formatting Errors in Geography File
Symptom: Validation errors about tabs, spaces, or binary codes
Solution: Use the reformat option:
python scripts/validate_geography_file.py input.csv --reformat -o geography.data
Optimization Fails to Converge
Symptom: NA values in parameter estimates or very negative log-likelihoods
Possible causes:
- •Tree and geography data mismatch
- •All species in same area (no variation)
- •Unrealistic max_range_size
Solution: Check input data quality and try simpler model first (DEC only)
Very Slow Runtime
Causes:
- •Large number of areas (>6-7 areas gets slow)
- •Large max_range_size
- •Many tips (>200)
Solutions:
- •Reduce max_range_size
- •Combine geographic areas if appropriate
- •Use
force_sparse = TRUEin run object - •Run on HPC cluster
Resources
This skill includes:
scripts/
- •
validate_geography_file.py - Validates and reformats geography files
- •Checks PHYLIP format compliance
- •Validates against tree tip labels
- •Reformats from CSV/TSV to PHYLIP
- •Usage:
python validate_geography_file.py --help
- •
biogeobears_analysis_template.Rmd - RMarkdown template for complete analysis
- •Model fitting for DEC, DIVALIKE, BAYAREALIKE (with/without +J)
- •Model comparison with AIC, AICc, weights
- •Likelihood ratio tests
- •Parameter visualization
- •Ancestral range plotting
- •Customizable via YAML parameters
references/
- •biogeobears_details.md - Comprehensive reference including:
- •Detailed model descriptions
- •Input file format specifications
- •Parameter interpretation guidelines
- •Plotting options and customization
- •Citations and further reading
- •Computational considerations
Load this reference when:
- •Users ask about specific models
- •Need to explain parameter estimates
- •Troubleshooting complex issues
- •Users want detailed methodology for publications
Best Practices
- •
Always validate input files before analysis - saves time debugging later
- •
Organize analysis in a dedicated directory - keeps everything together and reproducible
- •
Run all 6 models by default - model comparison is crucial for biogeographic inference
- •
Document parameters and decisions - analysis README helps with reproducibility
- •
Generate both visualization types - pie charts for uncertainty, text labels for clarity
- •
Save intermediate results - the RMarkdown template does this automatically
- •
Check parameter estimates - unrealistic values suggest data or model issues
- •
Provide context with visualizations - explain what dispersal/extinction rates mean for the user's system
Output Interpretation
When presenting results to users, explain:
Model Selection
- •AIC weights represent probability that each model is best
- •ΔAIC < 2: Models essentially equivalent
- •ΔAIC 2-7: Considerably less support
- •ΔAIC > 10: Essentially no support
Parameter Estimates
- •d (dispersal rate): Higher = more range expansions
- •e (extinction rate): Higher = more local extinctions
- •j (founder-event rate): Higher = more jump dispersal at speciation
- •Ratio d/e: > 1 favors expansion, < 1 favors contraction
Ancestral Ranges
- •Pie charts: Larger slices = higher probability
- •Colors: Represent areas (single area = bright color, multiple areas = blended)
- •Node labels: Most likely ancestral range
- •Split events (at corners): Range changes at speciation
Statistical Tests
- •LRT p < 0.05: +J parameter significantly improves fit
- •High AIC weight (>0.7): Strong evidence for one model
- •Similar AIC weights: Model uncertainty - report results from multiple models
Example Usage
User: "I have a phylogeny of 30 bird species and their distributions across 5 islands. Can you help me figure out where their ancestors lived?" Claude (using this skill): 1. Ask for tree and distribution file paths 2. Validate tree file (check 30 tips, rooted) 3. Validate/reformat geography file (5 areas) 4. Ask about max_range_size (suggest 4 areas) 5. Ask about models (suggest all 6) 6. Set up biogeobears_analysis/ directory structure 7. Copy template RMarkdown script with parameters 8. Generate README.md and run_analysis.sh 9. Provide clear instructions to run analysis 10. Explain expected outputs and how to interpret them Result: User has complete, ready-to-run analysis with documentation
Attribution
This skill was created based on:
- •BioGeoBEARS package by Nicholas Matzke
- •Tutorial resources from http://phylo.wikidot.com/biogeobears
- •Example workflows from the BioGeoBEARS GitHub repository
Additional Notes
Time estimate for skill execution:
- •File validation: 1-2 minutes
- •Directory setup: < 1 minute
- •Total setup time: 5-10 minutes
Analysis runtime (separate from skill execution):
- •Depends on tree size and number of areas
- •Small datasets (<50 tips, ≤5 areas): 10-30 minutes
- •Large datasets (>100 tips, >5 areas): 1-6 hours
Installation requirements (user must have):
- •R (≥4.0)
- •BioGeoBEARS R package
- •Supporting packages: ape, rmarkdown, knitr, kableExtra
- •Python 3 (for validation script)
When to consult references/:
- •Load
biogeobears_details.mdwhen users need detailed explanations of models, parameters, or interpretation - •Reference it for troubleshooting complex issues
- •Use it to help users write methods sections for publications