Optimize prompts through evolutionary search: generate variants, score with llm_judge, keep the fittest, mutate and recombine, repeat.
The Loop
1. Setup
Define before starting:
- •Seed prompt: the starting prompt to optimize
- •Test inputs: 3-5 representative inputs the prompt must handle
- •Judge criterion: a specific, binary pass/fail criterion (see
llm_judgeskill) - •Generations: how many rounds to run (default: 3)
- •Population size: variants per generation (default: 4)
2. Generate variants
Launch parallel sub-agents, each using discussion_partners to produce one variant. Apply two genetic operators:
Mutation (always): Literally change words in the prompt. Swap synonyms, delete filler, add specificity, reorder sentences, strengthen or weaken qualifiers. Each mutation should change 1-5 words or phrases — small, targeted edits, not wholesale rewrites.
Here is a prompt: [SEED]. Mutate it: change 1-5 specific words or phrases to potentially improve [CRITERION]. Keep the core intent. Return ONLY the new prompt text.
Crossover (generation 2+): Literally splice sections from two parent prompts. Split each parent into numbered sections (by paragraph or logical block), then take section 1 from parent A, section 2 from parent B, etc. — producing a child that inherits concrete text from both parents.
Here are two prompts that scored well: Parent A: [PROMPT_A] Parent B: [PROMPT_B] Split each into sections. Combine: take some sections from A and some from B to create a child prompt. Then apply 1-3 word-level mutations. Return ONLY the new prompt text.
For generation 1, all variants mutate from the seed. For later generations, half the population uses crossover from the top 2 parents, half uses mutation from a random survivor.
3. Score each variant
For each variant, for each test input:
- •Run the variant prompt against a test model (via
discussion_partners) - •Score the output using
llm_judgewith the defined criterion - •Record pass/fail per input
Variant fitness = pass rate across all test inputs.
4. Select
Keep the top 50% of variants (minimum 2). These become parents for the next generation.
5. Repeat
Mutate and recombine the surviving variants to fill the population. Run for the specified number of generations.
6. Return the winner
Output the highest-fitness prompt with its score and the specific test results.
Example
Seed: "Summarize this article in 3 bullet points" Criterion: "Each bullet captures a distinct key point, no redundancy, no hallucinated claims" Test inputs: [3 different articles] Generations: 3, Population: 4
Generation 1: 4 variants of the seed, scored → keep top 2 Generation 2: 2 survivors + 2 offspring (mutated/recombined), scored → keep top 2 Generation 3: same → return winner
Sub-Agent Orchestration
Each scoring round involves population × test_inputs parallel sub-agent calls. Structure:
- •Variant generation: parallel sub-agents, each uses
discussion_partnersto create one variant - •Output generation: parallel sub-agents, each runs one variant × one test input via
discussion_partners - •Judging: parallel sub-agents, each scores one output via
llm_judgepattern
Maximize parallelism at each step. Wait for all results before proceeding to selection.
When to Stop Early
- •If the top variant achieves 100% pass rate for 2 consecutive generations
- •If fitness plateaus (no improvement for 2 generations)