BERIL Research Observatory - Onboarding
Welcome the user and orient them to the system, then route them to the right context based on their goal.
Phase 1: System Overview
Present this information directly (no file reads needed):
What is BERDL?
The KBase BER Data Lakehouse (BERDL) is an on-prem Delta Lakehouse (Spark SQL) hosting 35 databases across 9 tenants. Key collections:
| Collection | Scale | What it contains |
|---|---|---|
kbase_ke_pangenome | 293K genomes, 1B genes, 27K species | Species-level pangenomes from GTDB r214: gene clusters, ANI, functional annotations (eggNOG), pathway predictions (GapMind), environmental embeddings (AlphaEarth) |
kbase_genomes | 293K genomes, 253M proteins | Structural genomics (contigs, features, protein sequences) in CDM format |
kbase_msd_biochemistry | 56K reactions, 46K molecules | ModelSEED biochemical reactions and compounds for metabolic modeling |
kescience_fitnessbrowser | 48 organisms, 27M fitness scores | Genome-wide mutant fitness from RB-TnSeq experiments |
enigma_coral | 3K taxa, 7K genomes | ENIGMA SFA environmental microbiology |
nmdc_arkin | 48 studies, 3M+ metabolomics | NMDC multi-omics (annotations, embeddings, metabolomics, proteomics) |
| PhageFoundry (5 DBs) | Various | Species-specific genome browsers for phage-host research |
planetmicrobe_planetmicrobe | 2K samples, 6K experiments | Marine microbial ecology |
Repo Structure
projects/ # Science projects (each has README.md + notebooks/ + data/) docs/ # Shared knowledge base collections.md # Full database inventory schemas/ # Per-collection schema docs pitfalls.md # SQL gotchas, data sparsity, common errors performance.md # Query strategies for large tables research_ideas.md # Future research directions overview.md # Scientific context and data generation workflow discoveries.md # Running log of insights .claude/skills/ # Agent skills data/ # Shared data extracts reusable across projects
Available Skills
| Skill | What it does |
|---|---|
/berdl | Query BERDL databases via REST API or Spark SQL |
/berdl-discover | Explore and document a new BERDL database |
/hypothesis | Generate testable research questions from BERDL data |
/research-plan | Refine a hypothesis with literature review, data feasibility checks, and a structured plan |
/notebook | Generate Jupyter notebooks from a research plan with PySpark boilerplate |
/literature-review | Search PubMed, Europe PMC, and other sources for relevant biological literature |
/synthesize | Read analysis outputs, compare against literature, and draft findings |
/submit | Submit a project for automated review |
Existing Projects
Discover projects dynamically — run ls projects/ to list them. Read the first line of each projects/*/README.md to get titles. Present the list to the user so they can see what's been done.
How Projects Work
Each project lives in projects/<name>/ with:
- •
README.md— Research question, hypothesis, approach, key findings, reproduction instructions - •
notebooks/— Analysis notebooks with saved outputs (results must be visible without re-running) - •
data/— Extracted/processed data (large files gitignored) - •
figures/— Key visualizations as standalone PNGs - •
requirements.txt— Python dependencies - •
REVIEW.md— Automated review (generated by/submit)
Reproducibility is required: notebooks must be committed with outputs, figures must be saved to figures/, and README must include a ## Reproduction section. See PROJECT.md for full standards.
Phase 2: Interactive Routing
Ask the user which of these they want to do:
- •Start a new research project
- •Explore BERDL data
- •Review published literature
- •Continue an existing project
- •Understand the system
Then follow the appropriate path below.
Path 1: Start a New Research Project
Read these files:
- •
docs/research_ideas.md— existing ideas and their status - •
docs/collections.md— what data is available - •
projects/cog_analysis/README.md— example of a well-structured project
Then:
- •Summarize the high-priority research ideas that are still PROPOSED
- •Mention cross-project integration opportunities
- •Present the full research workflow:
- •
/hypothesis— generate a testable research question - •
/research-plan— refine with literature review, check data feasibility, produce a structured plan - •
/notebook— generate analysis notebooks from the plan - •Run notebooks on BERDL JupyterHub
- •
/synthesize— interpret results, compare against literature, draft findings - •
/submit— pre-submission checks and automated review
- •
Path 2: Explore BERDL Data
Read these files:
- •
docs/collections.md— full database inventory - •
docs/pitfalls.md— critical gotchas before querying
Then:
- •Summarize what databases are available and their scale
- •Highlight cross-collection relationships (pangenome <-> genomes <-> biochemistry <-> fitness)
- •Suggest using
/berdlto start querying - •Suggest using
/berdl-discoverif they want to explore a database not yet documented indocs/schemas/ - •Warn about the key pitfalls (see Critical Pitfalls below)
Path 3: Review Published Literature
Suggest using /literature-review to search biological databases. This is useful for:
- •Checking what's already known about an organism or pathway before querying BERDL
- •Finding published pangenome analyses to compare against BERDL data
- •Supporting a hypothesis with existing citations
- •Discovering methods and approaches used in similar studies
MCP setup check: The pubmed-search MCP server is configured in .claude/settings.json. It runs via uvx pubmed-search-mcp. If it's not working:
- •Ensure
uvis installed:curl -LsSf https://astral.sh/uv/install.sh | sh - •Optionally add
NCBI_EMAILandNCBI_API_KEYto.envfor faster PubMed access (3→10 requests/sec) - •The skill falls back to WebSearch if the MCP server is unavailable
Path 4: Continue an Existing Project
Steps:
- •Run
ls projects/and list all projects for the user to choose from - •Read the chosen project's
README.md - •Check if a
REVIEW.mdexists in that project directory (read it if so) - •Summarize where the project stands: what's done, what's next
- •Suggest using
/submitwhen the project is ready for review
Path 5: Understand the System
Read these files:
- •
PROJECT.md— high-level goals and structure - •
docs/collections.md— database inventory - •
docs/overview.md— scientific context and data workflow
Then:
- •Walk through the dual goals (science + knowledge capture)
- •Explain the documentation workflow (tag discoveries, update pitfalls)
- •Mention the UI can be browsed at the BERDL JupyterHub
- •List the available skills and what each does
- •Point to
docs/research_ideas.mdfor future directions
Critical Pitfalls (always mention)
Regardless of path chosen, surface these early:
- •Species IDs contain
--— This is fine inside quoted strings in SQL. Use exact equality (WHERE id = 's__Escherichia_coli--RS_GCF_000005845.2'), not LIKE patterns. - •Large tables need filters — Never full-scan
gene(1B rows) orgenome_ani(420M rows). Always filter by species or genome ID. - •AlphaEarth embeddings cover only 28% of genomes (83K/293K) — check coverage before relying on them.
- •Notebooks must run on BERDL JupyterHub —
get_spark_session()is only available in JupyterHub kernels. Develop locally, upload and run on the hub. - •Auth token — stored in
.envasKBASE_AUTH_TOKEN(notKB_AUTH_TOKEN). - •String-typed numeric columns — Many databases store numbers as strings. Always CAST before comparisons.
- •Gene clusters are species-specific — Cannot compare cluster IDs across species. Use COG/KEGG/PFAM for cross-species comparisons.
- •Avoid unnecessary
.toPandas()—.toPandas()pulls all data to the driver node and can be very slow or cause OOM errors. Use PySpark DataFrame operations for filtering, joins, and aggregations. Only convert to pandas for final small results (plotting, CSV export).