AgentSkillsCN

berdl

查询 KBase BERDL(BER 数据湖仓)数据库。适用于用户希望探索泛基因组数据、查询物种信息、获取基因组统计、分析基因簇、访问功能注释,或查询生物化学数据时使用。

SKILL.md
--- frontmatter
name: berdl
description: Query the KBase BERDL (BER Data Lakehouse) databases. Use when the user asks to explore pangenome data, query species information, get genome statistics, analyze gene clusters, access functional annotations, or query biochemistry data.
allowed-tools: Bash, Read

BERDL Data Lakehouse Query Skill

Query the KBase BERDL Data Lakehouse containing pangenome and biochemistry data.

Available Databases

DatabaseModuleDescription
kbase_ke_pangenomepangenome.md293K genomes, 27K species pangenomes
kbase_msd_biochemistrybiochemistry.mdModelSEED reactions and compounds
kescience_fitnessbrowserSee docs/schemas/fitnessbrowser.md48 organisms, 27M fitness measurements
kbase_genomesSee docs/schemas/genomes.md293K genomes, 253M protein sequences

Cross-database patterns: cross-database.md — joining pangenome ↔ biochemistry ↔ fitness ↔ NCBI

For the full inventory of 35 databases across 9 tenants, see docs/collections.md.

Read the appropriate module for database-specific tables, schemas, and query patterns. Read query-patterns.md before writing any SQL — it contains mandatory safety rules and performance guidance.

Authentication

All API requests require the token from .env:

bash
AUTH_TOKEN=$(grep "KBASE_AUTH_TOKEN" .env | cut -d'"' -f2)

API Endpoints

Base URL: https://hub.berdl.kbase.us/apis/mcp/

List Databases

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"use_hms": true, "filter_by_namespace": true}' \
  https://hub.berdl.kbase.us/apis/mcp/delta/databases/list

List Tables

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"database": "DATABASE_NAME", "use_hms": true}' \
  https://hub.berdl.kbase.us/apis/mcp/delta/databases/tables/list

Get Schema

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"database": "DATABASE_NAME", "table": "TABLE_NAME"}' \
  https://hub.berdl.kbase.us/apis/mcp/delta/databases/tables/schema

Count Rows

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"database": "DATABASE_NAME", "table": "TABLE_NAME"}' \
  https://hub.berdl.kbase.us/apis/mcp/delta/tables/count

Sample Data

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"database": "DATABASE_NAME", "table": "TABLE_NAME", "limit": 5}' \
  https://hub.berdl.kbase.us/apis/mcp/delta/tables/sample

Execute SQL Query

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"query": "SELECT * FROM database.table LIMIT 10", "limit": 1000, "offset": 0}' \
  https://hub.berdl.kbase.us/apis/mcp/delta/tables/query

Structured SELECT (SQL-injection safe)

bash
curl -s -X POST \
  -H "Authorization: Bearer $AUTH_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "database": "DATABASE_NAME",
    "table": "TABLE_NAME",
    "columns": [{"column": "col1"}, {"column": "col2"}],
    "order_by": [{"column": "col1", "direction": "DESC"}],
    "limit": 20
  }' \
  https://hub.berdl.kbase.us/apis/mcp/delta/tables/select

Common Patterns

Pagination

sql
SELECT * FROM database.table
ORDER BY id
LIMIT 1000 OFFSET 0  -- First page

SELECT * FROM database.table
ORDER BY id
LIMIT 1000 OFFSET 1000  -- Second page

Always use ORDER BY for deterministic pagination.

Output Formatting

bash
... | python3 -m json.tool

Instructions for Claude

  1. Read auth token from .env first
  2. Read query-patterns.md — contains mandatory validation checklist and performance tiers
  3. Read the appropriate module for the target database (pangenome, biochemistry, etc.)
  4. Read cross-database.md if the query spans multiple databases
  5. Start with schema exploration if unfamiliar with table structure
  6. Check row counts with /count endpoint before querying large tables
  7. Use appropriate endpoint: /sample for inspection, /count for counts, /query for SQL
  8. Run the validation checklist from query-patterns.md before executing any SQL query
  9. Handle pagination for large result sets
  10. Include ORDER BY in queries for consistent pagination

Query Validation (mandatory)

Before executing any query, verify against the checklist in query-patterns.md:

  • Partitioned column filter present?
  • Large tables guarded?
  • Results bounded?
  • Types cast correctly?
  • Species IDs quoted?
  • Annotation NULLs filtered?
  • ORDER BY for pagination?
  • Correct JOIN keys?

Performance Tiers

Expected Result SizeStrategy
< 100K rowsREST API, .toPandas() OK
100K – 10M rowsFilter + aggregate in SQL first
> 10M rowsPySpark on JupyterHub only

Error Handling

ErrorMeaningSolution
504 Gateway TimeoutQuery took too longSimplify query, add filters, switch to JupyterHub
524 Origin TimeoutServer didn't respondRetry after a few seconds
503 "cannot schedule new futures"Spark executor restartingWait 30s, retry
Empty responseQuery failed silentlyCheck query syntax, verify table exists
Auth errorsInvalid or expired tokenValidate KBASE_AUTH_TOKEN in .env

Rule of thumb: If the REST API fails twice, switch to JupyterHub with spark.sql().

Adding New Databases

Use the /berdl-discover skill to introspect new databases and generate module files.

Pitfall Detection

When you encounter errors, unexpected results, retry cycles, performance issues, or data surprises during this task, follow the pitfall-capture protocol. Read .claude/skills/pitfall-capture/SKILL.md and follow its instructions to determine whether the issue should be added to docs/pitfalls.md.