AgentSkillsCN

berdl_start

开启 BERIL 研究观测站之旅。适用于新手用户、希望获得初步引导,或询问自己能做些什么时使用。

SKILL.md
--- frontmatter
name: berdl_start
description: Get started with the BERIL Research Observatory. Use when a user is new, wants orientation, or asks what they can do.
allowed-tools: Read, Bash
user-invocable: true

BERIL Research Observatory - Onboarding

Welcome the user and orient them to the system, then route them to the right context based on their goal.

Phase 1: System Overview

Present this information directly (no file reads needed):

What is BERDL?

The KBase BER Data Lakehouse (BERDL) is an on-prem Delta Lakehouse (Spark SQL) hosting 35 databases across 9 tenants. Key collections:

CollectionScaleWhat it contains
kbase_ke_pangenome293K genomes, 1B genes, 27K speciesSpecies-level pangenomes from GTDB r214: gene clusters, ANI, functional annotations (eggNOG), pathway predictions (GapMind), environmental embeddings (AlphaEarth)
kbase_genomes293K genomes, 253M proteinsStructural genomics (contigs, features, protein sequences) in CDM format
kbase_msd_biochemistry56K reactions, 46K moleculesModelSEED biochemical reactions and compounds for metabolic modeling
kescience_fitnessbrowser48 organisms, 27M fitness scoresGenome-wide mutant fitness from RB-TnSeq experiments
enigma_coral3K taxa, 7K genomesENIGMA SFA environmental microbiology
nmdc_arkin48 studies, 3M+ metabolomicsNMDC multi-omics (annotations, embeddings, metabolomics, proteomics)
PhageFoundry (5 DBs)VariousSpecies-specific genome browsers for phage-host research
planetmicrobe_planetmicrobe2K samples, 6K experimentsMarine microbial ecology

Repo Structure

code
projects/           # Science projects (each has README.md + notebooks/ + data/)
docs/               # Shared knowledge base
  collections.md    # Full database inventory
  schemas/          # Per-collection schema docs
  pitfalls.md       # SQL gotchas, data sparsity, common errors
  performance.md    # Query strategies for large tables
  research_ideas.md # Future research directions
  overview.md       # Scientific context and data generation workflow
  discoveries.md    # Running log of insights
.claude/skills/     # Agent skills
data/               # Shared data extracts reusable across projects

Available Skills

SkillWhat it does
/berdlQuery BERDL databases via REST API or Spark SQL
/berdl-discoverExplore and document a new BERDL database
/hypothesisGenerate testable research questions from BERDL data
/research-planRefine a hypothesis with literature review, data feasibility checks, and a structured plan
/notebookGenerate Jupyter notebooks from a research plan with PySpark boilerplate
/literature-reviewSearch PubMed, Europe PMC, and other sources for relevant biological literature
/synthesizeRead analysis outputs, compare against literature, and draft findings
/submitSubmit a project for automated review

Existing Projects

Discover projects dynamically — run ls projects/ to list them. Read the first line of each projects/*/README.md to get titles. Present the list to the user so they can see what's been done.

How Projects Work

Each project lives in projects/<name>/ with:

  • README.md — Research question, hypothesis, approach, key findings, reproduction instructions
  • notebooks/ — Analysis notebooks with saved outputs (results must be visible without re-running)
  • data/ — Extracted/processed data (large files gitignored)
  • figures/ — Key visualizations as standalone PNGs
  • requirements.txt — Python dependencies
  • REVIEW.md — Automated review (generated by /submit)

Reproducibility is required: notebooks must be committed with outputs, figures must be saved to figures/, and README must include a ## Reproduction section. See PROJECT.md for full standards.


Phase 2: Interactive Routing

Ask the user which of these they want to do:

  1. Start a new research project
  2. Explore BERDL data
  3. Review published literature
  4. Continue an existing project
  5. Understand the system

Then follow the appropriate path below.


Path 1: Start a New Research Project

Read these files:

  • docs/research_ideas.md — existing ideas and their status
  • docs/collections.md — what data is available
  • projects/cog_analysis/README.md — example of a well-structured project

Then:

  • Summarize the high-priority research ideas that are still PROPOSED
  • Mention cross-project integration opportunities
  • Present the full research workflow:
    1. /hypothesis — generate a testable research question
    2. /research-plan — refine with literature review, check data feasibility, produce a structured plan
    3. /notebook — generate analysis notebooks from the plan
    4. Run notebooks on BERDL JupyterHub
    5. /synthesize — interpret results, compare against literature, draft findings
    6. /submit — pre-submission checks and automated review

Path 2: Explore BERDL Data

Read these files:

  • docs/collections.md — full database inventory
  • docs/pitfalls.md — critical gotchas before querying

Then:

  • Summarize what databases are available and their scale
  • Highlight cross-collection relationships (pangenome <-> genomes <-> biochemistry <-> fitness)
  • Suggest using /berdl to start querying
  • Suggest using /berdl-discover if they want to explore a database not yet documented in docs/schemas/
  • Warn about the key pitfalls (see Critical Pitfalls below)

Path 3: Review Published Literature

Suggest using /literature-review to search biological databases. This is useful for:

  • Checking what's already known about an organism or pathway before querying BERDL
  • Finding published pangenome analyses to compare against BERDL data
  • Supporting a hypothesis with existing citations
  • Discovering methods and approaches used in similar studies

MCP setup check: The pubmed-search MCP server is configured in .claude/settings.json. It runs via uvx pubmed-search-mcp. If it's not working:

  1. Ensure uv is installed: curl -LsSf https://astral.sh/uv/install.sh | sh
  2. Optionally add NCBI_EMAIL and NCBI_API_KEY to .env for faster PubMed access (3→10 requests/sec)
  3. The skill falls back to WebSearch if the MCP server is unavailable

Path 4: Continue an Existing Project

Steps:

  1. Run ls projects/ and list all projects for the user to choose from
  2. Read the chosen project's README.md
  3. Check if a REVIEW.md exists in that project directory (read it if so)
  4. Summarize where the project stands: what's done, what's next
  5. Suggest using /submit when the project is ready for review

Path 5: Understand the System

Read these files:

  • PROJECT.md — high-level goals and structure
  • docs/collections.md — database inventory
  • docs/overview.md — scientific context and data workflow

Then:

  • Walk through the dual goals (science + knowledge capture)
  • Explain the documentation workflow (tag discoveries, update pitfalls)
  • Mention the UI can be browsed at the BERDL JupyterHub
  • List the available skills and what each does
  • Point to docs/research_ideas.md for future directions

Critical Pitfalls (always mention)

Regardless of path chosen, surface these early:

  1. Species IDs contain -- — This is fine inside quoted strings in SQL. Use exact equality (WHERE id = 's__Escherichia_coli--RS_GCF_000005845.2'), not LIKE patterns.
  2. Large tables need filters — Never full-scan gene (1B rows) or genome_ani (420M rows). Always filter by species or genome ID.
  3. AlphaEarth embeddings cover only 28% of genomes (83K/293K) — check coverage before relying on them.
  4. Notebooks must run on BERDL JupyterHubget_spark_session() is only available in JupyterHub kernels. Develop locally, upload and run on the hub.
  5. Auth token — stored in .env as KBASE_AUTH_TOKEN (not KB_AUTH_TOKEN).
  6. String-typed numeric columns — Many databases store numbers as strings. Always CAST before comparisons.
  7. Gene clusters are species-specific — Cannot compare cluster IDs across species. Use COG/KEGG/PFAM for cross-species comparisons.
  8. Avoid unnecessary .toPandas().toPandas() pulls all data to the driver node and can be very slow or cause OOM errors. Use PySpark DataFrame operations for filtering, joins, and aggregations. Only convert to pandas for final small results (plotting, CSV export).