Biological Sequence Retrieval
Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.
Workflow Overview
Phase 0: Clarify (if needed)
↓
Phase 1: Disambiguate Gene/Organism
↓
Phase 2: Search & Retrieve (Internal)
↓
Phase 3: Report Sequence Profile
Phase 0: Clarification (When Needed)
Ask the user ONLY if:
- •Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
- •Sequence type unclear (mRNA, genomic, protein?)
- •Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)
Skip clarification for:
- •Specific accession numbers (NC_, NM_, U*, etc.)
- •Clear organism + gene combinations
- •Complete genome requests with organism specified
Phase 1: Gene/Organism Disambiguation
1.1 Resolve Identifiers
from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()
# Strategy depends on input type
if user_provided_accession:
# Direct retrieval based on accession type
accession = user_provided_accession
elif user_provided_gene_and_organism:
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
limit=10
)
1.2 Accession Type Decision Tree
CRITICAL: Accession prefix determines which tools to use.
| Prefix | Type | Use With |
|---|---|---|
| NC_* | RefSeq chromosome | NCBI only |
| NM_* | RefSeq mRNA | NCBI only |
| NR_* | RefSeq ncRNA | NCBI only |
| NP_* | RefSeq protein | NCBI only |
| XM_* | RefSeq predicted mRNA | NCBI only |
| U*, M*, K*, X* | GenBank | NCBI or ENA |
| CP*, NZ_* | GenBank genome | NCBI or ENA |
| EMBL format | EMBL | ENA preferred |
1.3 Identity Resolution Checklist
- • Organism confirmed (scientific name)
- • Gene symbol/name identified
- • Sequence type determined (genomic/mRNA/protein)
- • Strain specified (if relevant)
- • Accession prefix identified → tool selection
Phase 2: Data Retrieval (Internal)
Retrieve silently. Do NOT narrate the search process.
2.1 Search for Sequences
# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism=organism,
gene=gene,
strain=strain, # Optional
keywords=keywords, # Optional
seq_type=seq_type, # complete_genome, mrna, refseq
limit=10
)
# Get accession numbers from UIDs
accessions = tu.tools.NCBI_fetch_accessions(
operation="fetch_accession",
uids=result["data"]["uids"]
)
2.2 Retrieve Sequence Data
# Get sequence in desired format
sequence = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="fasta" # or "genbank"
)
# GenBank format for annotations
annotations = tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession=accession,
format="genbank"
)
2.3 ENA Alternative (for GenBank/EMBL accessions)
# Only for non-RefSeq accessions!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
# ENA entry info
entry = tu.tools.ena_get_entry(accession=accession)
# ENA FASTA
fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
# ENA summary
summary = tu.tools.ena_get_entry_summary(accession=accession)
Fallback Chains
| Primary | Fallback | Notes |
|---|---|---|
| NCBI_get_sequence | ENA (if GenBank format) | NCBI unavailable |
| ENA_get_entry | NCBI_get_sequence | ENA doesn't have RefSeq |
| NCBI_search_nucleotide | Try broader keywords | No results |
Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.
Phase 3: Report Sequence Profile
Output Structure
Present as a Sequence Profile Report. Hide search process.
# Sequence Profile: [Gene/Organism] **Search Summary** - Query: [gene] in [organism] - Database: NCBI Nucleotide - Results: [N] sequences found --- ## Primary Sequence ### [Accession]: [Definition/Title] | Attribute | Value | |-----------|-------| | **Accession** | [accession] | | **Type** | RefSeq / GenBank | | **Organism** | [scientific name] | | **Strain** | [strain if applicable] | | **Length** | [X,XXX bp / aa] | | **Molecule** | DNA / mRNA / Protein | | **Topology** | Linear / Circular | **Curation Level**: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party ### Sequence Statistics | Statistic | Value | |-----------|-------| | **Length** | [X,XXX] bp | | **GC Content** | [XX.X]% | | **Genes** | [N] (if genome) | | **CDS** | [N] (if annotated) | ### Sequence Preview ```fasta >[accession] [definition] ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA ... [truncated, full sequence in download]
Annotations Summary (from GenBank format)
| Feature | Count | Examples |
|---|---|---|
| CDS | [N] | [gene names] |
| tRNA | [N] | - |
| rRNA | [N] | 16S, 23S |
| Regulatory | [N] | promoters |
Alternative Sequences
Ranked by relevance and curation level:
| Accession | Type | Length | Description | ENA Compatible |
|---|---|---|---|---|
| NC_000913.3 | RefSeq | 4.6 Mb | E. coli K-12 reference | ✗ |
| U00096.3 | GenBank | 4.6 Mb | E. coli K-12 | ✓ |
| CP001509.3 | GenBank | 4.6 Mb | E. coli DH10B | ✓ |
Cross-Database References
| Database | Accession | Link |
|---|---|---|
| RefSeq | [NC_*] | [NCBI link] |
| GenBank | [U*] | [NCBI link] |
| ENA/EMBL | [same as GenBank] | [ENA link] |
| BioProject | [PRJNA*] | [link] |
| BioSample | [SAMN*] | [link] |
Download Options
Formats Available
| Format | Description | Use Case |
|---|---|---|
| FASTA | Sequence only | BLAST, alignment |
| GenBank | Sequence + annotations | Gene analysis |
| GFF3 | Annotations only | Genome browsers |
Direct Commands
# FASTA format
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="fasta"
)
# GenBank format (with annotations)
tu.tools.NCBI_get_sequence(
operation="fetch_sequence",
accession="[accession]",
format="genbank"
)
Related Sequences
Other Strains/Isolates
| Accession | Strain | Similarity | Notes |
|---|---|---|---|
| [acc1] | [strain1] | 99.9% | [notes] |
| [acc2] | [strain2] | 99.5% | [notes] |
Protein Products (if applicable)
| Protein Accession | Product Name | Length |
|---|---|---|
| [NP_*] | [protein name] | [X] aa |
Retrieved: [date] Database: NCBI Nucleotide
--- ## Curation Level Tiers (Aligned with Evidence Grading) ### Sequence Curation Levels | Tier | Symbol | Accession Prefix | Description | Evidence Equivalent | |------|--------|------------------|-------------|---------------------| | RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard | ★★★ | | RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted | ★★☆ | | GenBank Validated | ●●○○ | Various | Submitted, some curation | ★★☆ | | GenBank Direct | ●○○○ | Various | Direct submission | ★☆☆ | | Third Party | ○○○○ | TPA_ | Third-party annotation | ★☆☆ | ### Data Reliability Mapping | Data Type | Reliability | Notes | |-----------|-------------|-------| | RefSeq curated sequence | ★★★ | Gold standard for reference | | RefSeq annotations | ★★★ | Validated gene models | | GenBank sequence | ★★☆ | Submitted, generally reliable | | GenBank annotations | ★☆☆ | Submitter-provided, verify | | Predicted genes (XM_) | ★★☆ | Computational, may lack validation | | Genome assembly | ★★★-★☆☆ | Depends on assembly quality | Include in report: ```markdown **Curation Level**: ●●●● RefSeq Reference (★★★) - Curated by NCBI RefSeq project - Regular updates and validation - Recommended for reference use **Data Reliability Note**: - Sequence: ★★★ (experimentally derived) - Gene annotations: ★★★ (curated models) - Variant annotations: ★★☆ (computational)
Completeness Checklist
Every sequence report MUST include:
Per Sequence (Required)
- • Accession number
- • Organism (scientific name)
- • Sequence type (DNA/RNA/protein)
- • Length
- • Curation level
- • Database source
Search Summary (Required)
- • Query parameters
- • Number of results
- • Ranking rationale
Include Even If Limited
- • Alternative sequences (or "Only one sequence found")
- • Cross-database references (or "No cross-references available")
- • Download instructions
Common Use Cases
Reference Genome
User: "Get E. coli K-12 complete genome"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Escherichia coli",
strain="K-12",
seq_type="complete_genome",
limit=3
)
# Return NC_000913.3 (RefSeq reference)
Gene Sequence
User: "Find human BRCA1 mRNA"
result = tu.tools.NCBI_search_nucleotide(
operation="search",
organism="Homo sapiens",
gene="BRCA1",
seq_type="mrna",
limit=10
)
Specific Accession
User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata
Strain Comparison
User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table
Error Handling
| Error | Response |
|---|---|
| "No search criteria provided" | Add organism, gene, or keywords |
| "ENA 404 error" | Accession is likely RefSeq → use NCBI only |
| "No results found" | Broaden search, check spelling, try synonyms |
| "Sequence too large" | Note size, provide download link instead of preview |
| "API rate limit" | Tools auto-retry; if persistent, wait briefly |
Tool Reference
NCBI Tools (All Accessions)
| Tool | Purpose |
|---|---|
NCBI_search_nucleotide | Search by gene/organism |
NCBI_fetch_accessions | Convert UIDs to accessions |
NCBI_get_sequence | Retrieve sequence data |
ENA Tools (GenBank/EMBL Only)
| Tool | Purpose |
|---|---|
ena_get_entry | Entry metadata |
ena_get_sequence_fasta | FASTA sequence |
ena_get_entry_summary | Summary info |
Search Parameters Reference
NCBI_search_nucleotide
| Parameter | Description | Example |
|---|---|---|
operation | Always "search" | "search" |
organism | Scientific name | "Homo sapiens" |
gene | Gene symbol | "BRCA1" |
strain | Specific strain | "K-12" |
keywords | Free text | "complete genome" |
seq_type | Sequence type | "complete_genome", "mrna", "refseq" |
limit | Max results | 10 |
NCBI_get_sequence
| Parameter | Description | Example |
|---|---|---|
operation | Always "fetch_sequence" | "fetch_sequence" |
accession | Accession number | "NC_000913.3" |
format | Output format | "fasta", "genbank" |