Biological Sequence Retrieval

Retrieve DNA, RNA, and protein sequences with proper disambiguation and cross-database handling.

Workflow Overview

code

Phase 0: Clarify (if needed)
    ↓
Phase 1: Disambiguate Gene/Organism
    ↓
Phase 2: Search & Retrieve (Internal)
    ↓
Phase 3: Report Sequence Profile

Phase 0: Clarification (When Needed)

Ask the user ONLY if:

•Gene name exists in multiple organisms (e.g., "BRCA1" → human or mouse?)
•Sequence type unclear (mRNA, genomic, protein?)
•Strain/isolate matters (e.g., E. coli → K-12, O157:H7, etc.)

Skip clarification for:

•Specific accession numbers (NC_, NM_, U*, etc.)
•Clear organism + gene combinations
•Complete genome requests with organism specified

Phase 1: Gene/Organism Disambiguation

1.1 Resolve Identifiers

python

from tooluniverse import ToolUniverse
tu = ToolUniverse()
tu.load_tools()

# Strategy depends on input type
if user_provided_accession:
    # Direct retrieval based on accession type
    accession = user_provided_accession
    
elif user_provided_gene_and_organism:
    # Search NCBI Nucleotide
    result = tu.tools.NCBI_search_nucleotide(
        operation="search",
        organism=organism,
        gene=gene,
        limit=10
    )

1.2 Accession Type Decision Tree

CRITICAL: Accession prefix determines which tools to use.

Prefix	Type	Use With
NC_*	RefSeq chromosome	NCBI only
NM_*	RefSeq mRNA	NCBI only
NR_*	RefSeq ncRNA	NCBI only
NP_*	RefSeq protein	NCBI only
XM_*	RefSeq predicted mRNA	NCBI only
U, M, K, X	GenBank	NCBI or ENA
CP, NZ_	GenBank genome	NCBI or ENA
EMBL format	EMBL	ENA preferred

1.3 Identity Resolution Checklist

• Organism confirmed (scientific name)
• Gene symbol/name identified
• Sequence type determined (genomic/mRNA/protein)
• Strain specified (if relevant)
• Accession prefix identified → tool selection

Phase 2: Data Retrieval (Internal)

Retrieve silently. Do NOT narrate the search process.

2.1 Search for Sequences

python

# Search NCBI Nucleotide
result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism=organism,
    gene=gene,
    strain=strain,  # Optional
    keywords=keywords,  # Optional
    seq_type=seq_type,  # complete_genome, mrna, refseq
    limit=10
)

# Get accession numbers from UIDs
accessions = tu.tools.NCBI_fetch_accessions(
    operation="fetch_accession",
    uids=result["data"]["uids"]
)

2.2 Retrieve Sequence Data

python

# Get sequence in desired format
sequence = tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession=accession,
    format="fasta"  # or "genbank"
)

# GenBank format for annotations
annotations = tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession=accession,
    format="genbank"
)

2.3 ENA Alternative (for GenBank/EMBL accessions)

python

# Only for non-RefSeq accessions!
if not accession.startswith(("NC_", "NM_", "NR_", "NP_", "XM_", "XR_")):
    # ENA entry info
    entry = tu.tools.ena_get_entry(accession=accession)
    
    # ENA FASTA
    fasta = tu.tools.ena_get_sequence_fasta(accession=accession)
    
    # ENA summary
    summary = tu.tools.ena_get_entry_summary(accession=accession)

Fallback Chains

Primary	Fallback	Notes
NCBI_get_sequence	ENA (if GenBank format)	NCBI unavailable
ENA_get_entry	NCBI_get_sequence	ENA doesn't have RefSeq
NCBI_search_nucleotide	Try broader keywords	No results

Critical Rule: Never try ENA tools with RefSeq accessions (NC_, NM_, etc.) - they will return 404 errors.

Phase 3: Report Sequence Profile

Output Structure

Present as a Sequence Profile Report. Hide search process.

markdown

# Sequence Profile: [Gene/Organism]

**Search Summary**
- Query: [gene] in [organism]
- Database: NCBI Nucleotide
- Results: [N] sequences found

---

## Primary Sequence

### [Accession]: [Definition/Title]

| Attribute | Value |
|-----------|-------|
| **Accession** | [accession] |
| **Type** | RefSeq / GenBank |
| **Organism** | [scientific name] |
| **Strain** | [strain if applicable] |
| **Length** | [X,XXX bp / aa] |
| **Molecule** | DNA / mRNA / Protein |
| **Topology** | Linear / Circular |

**Curation Level**: ●●● RefSeq (curated) / ●●○ GenBank (submitted) / ●○○ Third-party

### Sequence Statistics
| Statistic | Value |
|-----------|-------|
| **Length** | [X,XXX] bp |
| **GC Content** | [XX.X]% |
| **Genes** | [N] (if genome) |
| **CDS** | [N] (if annotated) |

### Sequence Preview
```fasta
>[accession] [definition]
ATGCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCG
ATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGATCGA
... [truncated, full sequence in download]

Annotations Summary (from GenBank format)

Feature	Count	Examples
CDS	[N]	[gene names]
tRNA	[N]	-
rRNA	[N]	16S, 23S
Regulatory	[N]	promoters

Alternative Sequences

Ranked by relevance and curation level:

Accession	Type	Length	Description	ENA Compatible
NC_000913.3	RefSeq	4.6 Mb	E. coli K-12 reference	✗
U00096.3	GenBank	4.6 Mb	E. coli K-12	✓
CP001509.3	GenBank	4.6 Mb	E. coli DH10B	✓

Cross-Database References

Database	Accession	Link
RefSeq	[NC_*]	[NCBI link]
GenBank	[U*]	[NCBI link]
ENA/EMBL	[same as GenBank]	[ENA link]
BioProject	[PRJNA*]	[link]
BioSample	[SAMN*]	[link]

Download Options

Formats Available

Format	Description	Use Case
FASTA	Sequence only	BLAST, alignment
GenBank	Sequence + annotations	Gene analysis
GFF3	Annotations only	Genome browsers

Direct Commands

python

# FASTA format
tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession="[accession]",
    format="fasta"
)

# GenBank format (with annotations)
tu.tools.NCBI_get_sequence(
    operation="fetch_sequence",
    accession="[accession]",
    format="genbank"
)

Related Sequences

Other Strains/Isolates

Accession	Strain	Similarity	Notes
[acc1]	[strain1]	99.9%	[notes]
[acc2]	[strain2]	99.5%	[notes]

Protein Products (if applicable)

Protein Accession	Product Name	Length
[NP_*]	[protein name]	[X] aa

Retrieved: [date] Database: NCBI Nucleotide

code


---

## Curation Level Tiers (Aligned with Evidence Grading)

### Sequence Curation Levels
| Tier | Symbol | Accession Prefix | Description | Evidence Equivalent |
|------|--------|------------------|-------------|---------------------|
| RefSeq Reference | ●●●● | NC_, NM_, NP_ | NCBI-curated, gold standard | ★★★ |
| RefSeq Predicted | ●●●○ | XM_, XP_, XR_ | Computationally predicted | ★★☆ |
| GenBank Validated | ●●○○ | Various | Submitted, some curation | ★★☆ |
| GenBank Direct | ●○○○ | Various | Direct submission | ★☆☆ |
| Third Party | ○○○○ | TPA_ | Third-party annotation | ★☆☆ |

### Data Reliability Mapping
| Data Type | Reliability | Notes |
|-----------|-------------|-------|
| RefSeq curated sequence | ★★★ | Gold standard for reference |
| RefSeq annotations | ★★★ | Validated gene models |
| GenBank sequence | ★★☆ | Submitted, generally reliable |
| GenBank annotations | ★☆☆ | Submitter-provided, verify |
| Predicted genes (XM_) | ★★☆ | Computational, may lack validation |
| Genome assembly | ★★★-★☆☆ | Depends on assembly quality |

Include in report:
```markdown
**Curation Level**: ●●●● RefSeq Reference (★★★)
- Curated by NCBI RefSeq project
- Regular updates and validation
- Recommended for reference use

**Data Reliability Note**: 
- Sequence: ★★★ (experimentally derived)
- Gene annotations: ★★★ (curated models)
- Variant annotations: ★★☆ (computational)

Completeness Checklist

Every sequence report MUST include:

Per Sequence (Required)

Search Summary (Required)

• Query parameters
• Number of results
• Ranking rationale

Include Even If Limited

• Alternative sequences (or "Only one sequence found")
• Cross-database references (or "No cross-references available")
• Download instructions

Common Use Cases

Reference Genome

User: "Get E. coli K-12 complete genome"

python

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Escherichia coli",
    strain="K-12",
    seq_type="complete_genome",
    limit=3
)
# Return NC_000913.3 (RefSeq reference)

Gene Sequence

User: "Find human BRCA1 mRNA"

python

result = tu.tools.NCBI_search_nucleotide(
    operation="search",
    organism="Homo sapiens",
    gene="BRCA1",
    seq_type="mrna",
    limit=10
)

Specific Accession

User: "Get sequence for NC_045512.2" → Direct retrieval with full metadata

Strain Comparison

User: "Compare E. coli K-12 and O157:H7 genomes" → Search both strains, provide comparison table

Error Handling

Error	Response
"No search criteria provided"	Add organism, gene, or keywords
"ENA 404 error"	Accession is likely RefSeq → use NCBI only
"No results found"	Broaden search, check spelling, try synonyms
"Sequence too large"	Note size, provide download link instead of preview
"API rate limit"	Tools auto-retry; if persistent, wait briefly

Tool Reference

NCBI Tools (All Accessions)

Tool	Purpose
`NCBI_search_nucleotide`	Search by gene/organism
`NCBI_fetch_accessions`	Convert UIDs to accessions
`NCBI_get_sequence`	Retrieve sequence data

ENA Tools (GenBank/EMBL Only)

Tool	Purpose
`ena_get_entry`	Entry metadata
`ena_get_sequence_fasta`	FASTA sequence
`ena_get_entry_summary`	Summary info

Search Parameters Reference

NCBI_search_nucleotide

Parameter	Description	Example
`operation`	Always "search"	"search"
`organism`	Scientific name	"Homo sapiens"
`gene`	Gene symbol	"BRCA1"
`strain`	Specific strain	"K-12"
`keywords`	Free text	"complete genome"
`seq_type`	Sequence type	"complete_genome", "mrna", "refseq"
`limit`	Max results	10

NCBI_get_sequence

Parameter	Description	Example
`operation`	Always "fetch_sequence"	"fetch_sequence"
`accession`	Accession number	"NC_000913.3"
`format`	Output format	"fasta", "genbank"