ORF Verifier Tool
Verify that expected amino acid sequences are present in plasmid DNA using six-frame translation.
CLI Usage
Use the CLI at scripts/orf_verifier_cli.py:
Basic Verification
bash
scripts/orf_verifier_cli.py --plasmid /path/to/plasmid.gb --aa-sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK" --name "GFP"
With NCBI Accession
bash
scripts/orf_verifier_cli.py --accession MN514974 --aa-sequence "MHHHHHH..." --name "His6-protein"
JSON Output
bash
scripts/orf_verifier_cli.py --plasmid plasmid.fasta --aa-sequence "MHHHHHH..." --json
Options
- •
--plasmid, -p FILE: Path to plasmid file (FASTA or GenBank format) - •
--accession, -a ID: NCBI accession number to fetch - •
--aa-sequence, -s SEQUENCE: Amino acid sequence to verify (required) - •
--name, -n NAME: Name for the ORF (default: Query) - •
--allow-alt-start: Allow alternative start codons (GTG, TTG) - •
--min-identity FLOAT: Minimum identity threshold (0-1, default: 1.0) - •
--max-mismatches INT: Maximum allowed amino acid mismatches (default: 0) - •
--disallow-internal-met: Disallow internal methionine start codons - •
--json: Output results as JSON - •
--list-tags: List available tag definitions
Verification Results
Status Types
- •verified: ORF found at exactly one location
- •not-found: ORF not found in any reading frame
- •indeterminate: Multiple possible locations found
Placement Information
- •Strand: Plus (+) or Minus (-) strand
- •Frame: Reading frame (0, 1, or 2)
- •Position: Nucleotide start and end positions (1-indexed in output)
- •Wraps origin: Whether the ORF crosses the plasmid origin
Auto-Detected Tags
The tool automatically detects common protein tags:
- •His6, His10 (polyhistidine)
- •N-His6-TEV, N-His6-MBP-N10-TEV
- •StrepII, FLAG, HA, Myc
- •TEV site, HRV 3C site
- •SUMO
- •GGGGS linkers
View all tags:
bash
scripts/orf_verifier_cli.py --list-tags
Annotation Verification Mode
Batch verify all CDS annotations in pLannotate-annotated GenBank files against UniProt reference sequences.
Usage
bash
# Verify single plasmid python3 scripts/orf_verifier_cli.py verify-annotations plasmid.gbk --targets-only # Verify multiple files python3 scripts/orf_verifier_cli.py verify-annotations *.gbk --targets-only --summary # Verify directory of sequencing results python3 scripts/orf_verifier_cli.py verify-annotations sequencing_results/ --targets-only -o report.md
Options
- •
--targets-only: Only verify target proteins (HUMAN, MOUSE, BOVIN), skip E. coli vector components - •
--organism, -O CODE: Filter to specific organism (e.g., HUMAN) - •
--summary: Show only summary, not per-protein details - •
--json: Output as JSON - •
--output, -o FILE: Write to file instead of stdout
Result Status
- •PASS: 100% identity to UniProt reference (includes valid N/C-terminal truncations)
- •FAIL: Less than 100% identity (mutations, deletions, insertions)
- •ERROR: Could not verify (UniProt lookup failed, translation error)
Example Output
code
============================================================ ANNOTATION VERIFICATION SUMMARY ============================================================ Files processed: 10 Total PASS: 42 Total FAIL: 2 Total ERROR: 0 [PASS] plasmid_1.gbk: 7/7 passed [FAIL] plasmid_2.gbk: 3/4 passed
Handling Truncations
The tool correctly handles N-terminal and C-terminal truncations as PASS if the expressed region is 100% identical to UniProt. Notes indicate truncation position:
code
[PASS] MED14_HUMAN (O60244)
Length: 1409 aa (UniProt: 1454 aa)
Identity: 100.0%
Note: N-term truncated: starts at UniProt position 46