Research Hypothesis Generation Skill
Generate testable research questions using BERDL data.
Available Data for Research
Pangenome Collection (kbase_ke_pangenome)
Scale: 293,059 genomes across 27,690 species from GTDB r214
Key Data Types:
- •Pangenome structure: Core, accessory, and singleton gene classifications per species
- •Functional annotations: COG categories, KEGG pathways, GO terms, EC numbers, PFAM domains
- •Genome quality: CheckM completeness/contamination, assembly statistics
- •Phylogenetic context: Full GTDB taxonomy, intra-species ANI values
- •Environmental context: Sample metadata and environmental embeddings (~28% of genomes)
- •Sequence relationships: Pairwise ANI values (421M pairs)
Tables:
- •
genome(293,059) - Genome metadata - •
pangenome(27,690) - Per-species pangenome statistics - •
gene_cluster- Core/accessory/singleton classification - •
eggnog_mapper_annotations(93M+) - Functional annotations - •
genome_ani(421M) - Pairwise ANI values - •
gtdb_metadata- Quality metrics and assembly stats - •
sample,ncbi_env- Environmental metadata
ModelSEED Biochemistry (kbase_msd_biochemistry)
Scale: Comprehensive biochemistry reference database
Key Data Types:
- •Reactions: 56,012 biochemical reactions with thermodynamic data
- •Compounds: 45,708 metabolites with structures
- •Stoichiometry: Reaction-compound relationships
- •Molecular structures: SMILES, InChIKey representations
Tables:
- •
reaction(56,012) - Biochemical reactions with deltaG - •
molecule(45,708) - Chemical compounds - •
reagent(262,517) - Reaction stoichiometry - •
structure(97,490) - Molecular structures
Research Question Templates
Comparative Genomics
- •
Core vs Accessory Function
- •"Which COG/KEGG categories are enriched in core vs accessory genes of [species]?"
- •"Do essential metabolic pathways localize to the core genome?"
- •"What functional categories are over-represented in singleton genes?"
- •
Pangenome Architecture
- •"Do pathogens have more open or closed pangenomes than environmental bacteria?"
- •"Is pangenome openness correlated with genome size or habitat diversity?"
- •"Which taxonomic groups show the largest accessory genomes?"
- •
Phylogenetic Patterns
- •"How does intra-species ANI correlate with core genome size?"
- •"Do closely related species share more accessory genes than expected by chance?"
- •"At what phylogenetic depth do functional categories diverge?"
Ecological Genomics
- •
Environment-Genome Relationships
- •"Do genomes from similar environments share more genes than expected from phylogeny?"
- •"Which gene functions are enriched in specific environmental niches?"
- •"Can environmental embeddings predict gene content?"
- •
Niche Adaptation
- •"What accessory genes distinguish host-associated vs free-living strains?"
- •"Are there core metabolic differences between aquatic and terrestrial bacteria?"
- •"Which pathways show environment-specific expansion or loss?"
Metabolic Analysis
- •
Pathway Conservation
- •"Which metabolic pathways are universally conserved vs lineage-specific?"
- •"Do core genes encode different metabolic functions than accessory genes?"
- •"What is the distribution of transport reactions across bacterial phyla?"
- •
Thermodynamic Constraints
- •"Are thermodynamically favorable reactions (low deltaG) more conserved?"
- •"Do essential pathways contain irreversible reactions?"
- •"Is reaction reversibility correlated with gene essentiality?"
Functional Evolution
- •Gene Family Dynamics
- •"Which gene families show the highest rates of gain/loss across species?"
- •"Are certain COG categories more prone to horizontal transfer?"
- •"Do gene clusters show functional coherence in accessory regions?"
Hypothesis Development Framework
When developing a hypothesis, specify these components:
1. Observation
What pattern do you see or expect in the data?
Example: "Pathogenic species may have smaller core genomes and larger accessory genomes than environmental species."
2. Null Hypothesis (H0)
What would you expect by chance or in the absence of the effect?
Example: "Core genome fraction is independent of pathogenic lifestyle."
3. Test Strategy
What SQL queries or analyses would test this?
-- Compare core fraction between pathogens and non-pathogens -- (Requires external pathogen classification or use of specific taxa) SELECT s.GTDB_species, p.no_genomes, CAST(p.no_core AS FLOAT) / p.no_gene_clusters as core_fraction, 'pathogen' as category FROM kbase_ke_pangenome.pangenome p JOIN kbase_ke_pangenome.gtdb_species_clade s ON p.gtdb_species_clade_id = s.gtdb_species_clade_id WHERE s.GTDB_species LIKE '%Salmonella%' OR s.GTDB_species LIKE '%Staphylococcus_aureus%' OR s.GTDB_species LIKE '%Streptococcus_pneumoniae%'
4. Potential Confounders
What factors might create spurious associations?
- •Sampling bias: Over-represented species may not be representative
- •Phylogenetic signal: Related species share traits; must control for phylogeny
- •Genome quality: Incomplete genomes affect gene counts
- •Annotation completeness: Some genomes may have better functional annotations
- •Definition thresholds: Core gene definitions depend on prevalence cutoffs
Example Hypothesis Projects
Project 1: COG Category Analysis
Question: Are certain functional categories preferentially core vs accessory?
Approach:
SELECT ann.COG_category, gc.is_core, COUNT(*) as gene_count FROM kbase_ke_pangenome.gene_cluster gc JOIN kbase_ke_pangenome.eggnog_mapper_annotations ann ON gc.gene_cluster_id = ann.query_name WHERE gc.gtdb_species_clade_id = '[SPECIES_ID]' GROUP BY ann.COG_category, gc.is_core ORDER BY ann.COG_category, gc.is_core
Expected patterns:
- •Core: J (translation), L (replication), F (nucleotide metabolism)
- •Accessory: V (defense), X (mobilome), S (unknown function)
Project 2: Pangenome Openness Survey
Question: Which taxonomic groups have the most open pangenomes?
Approach:
SELECT SPLIT_PART(s.GTDB_taxonomy, ';', 2) as phylum, AVG(CAST(p.no_singleton_gene_clusters AS FLOAT) / p.no_gene_clusters) as avg_singleton_frac, AVG(CAST(p.no_core AS FLOAT) / p.no_gene_clusters) as avg_core_frac, COUNT(*) as n_species FROM kbase_ke_pangenome.pangenome p JOIN kbase_ke_pangenome.gtdb_species_clade s ON p.gtdb_species_clade_id = s.gtdb_species_clade_id WHERE p.no_genomes >= 10 GROUP BY SPLIT_PART(s.GTDB_taxonomy, ';', 2) HAVING COUNT(*) >= 10 ORDER BY avg_singleton_frac DESC
Project 3: Environment-Function Association
Question: Do genomes from aquatic environments encode different functions?
Approach:
- •Identify genomes with environmental metadata
- •Compare COG distributions between environment categories
- •Test for enrichment using appropriate statistics
Generating Hypotheses: Step-by-Step
When a user describes their research interest:
- •
Identify relevant BERDL tables and columns
- •Which tables contain the variables needed?
- •What are the join keys between tables?
- •
Suggest 2-3 testable hypotheses
- •Frame as null hypothesis that can be rejected
- •Ensure data exists to address the question
- •
Provide example SQL query
- •Working query that returns relevant data
- •Include appropriate filters and aggregations
- •
Note potential confounders
- •Sampling bias, phylogenetic non-independence
- •Data quality issues, annotation completeness
- •
Link to similar existing analyses
- •Reference
projects/directory for examples - •Suggest building on existing work
- •Reference
- •
Suggest literature review
- •After generating hypotheses, suggest: "Use
/literature-reviewto check whether this hypothesis has been explored in published research" - •Literature review can reveal: prior results, established methods, confounders identified by others, and gaps that BERDL could uniquely fill
- •After generating hypotheses, suggest: "Use
- •
Suggest research plan
- •After presenting hypotheses and literature review suggestion, also suggest: "Use
/research-planto refine this hypothesis with a literature review, check data feasibility, and create a structured plan for your analysis"
- •After presenting hypotheses and literature review suggestion, also suggest: "Use
Data Limitations to Consider
- •Taxonomic bias: Some species vastly over-represented (e.g., E. coli, S. aureus)
- •Annotation completeness: ~40% of genes lack functional annotation
- •Environmental metadata: Only ~28% of genomes have environmental embeddings
- •Pangenome definitions: Based on specific clustering parameters
- •ModelSEED coverage: Not all reactions have thermodynamic data
Quick Reference: Key Columns for Hypotheses
| Research Area | Key Tables | Key Columns |
|---|---|---|
| Pangenome structure | pangenome, gene_cluster | no_core, no_aux_genome, is_core |
| Functional annotations | eggnog_mapper_annotations | COG_category, KEGG_Pathway, EC |
| Taxonomy | gtdb_species_clade, gtdb_taxonomy_r214v1 | GTDB_taxonomy, phylum, genus |
| Quality metrics | gtdb_metadata | checkm_completeness, checkm_contamination |
| Environment | sample, ncbi_env | Environmental attributes |
| Biochemistry | reaction, molecule | deltag, reversibility, is_transport |
Pitfall Detection
When you encounter errors, unexpected results, retry cycles, performance issues, or data surprises during this task, follow the pitfall-capture protocol. Read .claude/skills/pitfall-capture/SKILL.md and follow its instructions to determine whether the issue should be added to docs/pitfalls.md.