To build great AI products, you must transition from subjective "vibe checks" to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.
Phase 1: Open Coding (The "Benevolent Dictator" Phase)
Before automating, you must manually ground yourself in the data. Appoint one "Benevolent Dictator"—typically the Product Manager or domain expert—to define "good" taste.
- •Sample the Data: Extract 50–100 "traces" (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
- •Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an "Open Code") describing the first thing that went wrong.
- •Rule: Don't overthink it. Use specific language (e.g., "hallucinated virtual tour," "didn't confirm call transfer") rather than just "bad."
- •Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).
Phase 2: Axial Coding (Categorization)
Synthesize your mess of notes into actionable categories using an LLM.
- •Export Notes: Put your open codes into a CSV or spreadsheet.
- •Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5–7 "Axial Codes" (failure categories).
- •Prompt Pattern: "Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem."
- •Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
- •Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.
Phase 3: Build the "LLM as Judge"
For complex, subjective failures (like "human handoff quality"), create an automated evaluator.
- •Write the Judge Prompt: Create a separate prompt for an LLM whose only job is to evaluate one specific failure mode.
- •Enforce Binary Scoring: Require the judge to output only True or False.
- •Note: Avoid 1–5 or 1–10 scales. They result in "weasel" metrics (e.g., a score of 3.7) that provide no clear direction for improvement.
- •Define Rules: Include specific criteria from your "Benevolent Dictator" notes.
- •Example: "Output True if the user explicitly asked for a human and the assistant responded with a tool call without acknowledging the request."
Phase 4: Alignment & Validation
Never ship an eval until you know the judge matches human judgment.
- •Create an Agreement Matrix: Compare the Judge's True/False labels against your manual labels from Phase 1.
- •Review Mismatches: Specifically look at:
- •False Positives: Judge said error, Human said no error.
- •False Negatives: Human said error, Judge said no error.
- •Iterate: Refine the Judge's prompt until it aligns with the "Benevolent Dictator" at least 80–90% of the time.
Examples
Example 1: Real Estate AI Assistant
- •Context: AI is supposed to book apartment tours.
- •Open Code: "AI told the user a virtual tour was available when the property only offers in-person tours."
- •Axial Code: "Capability Misrepresentation."
- •Judge Logic: "Check the 'Property Context' tool output. If 'virtual_tour' is False, but the LLM response contains 'virtual tour,' output True (Error)."
Example 2: Customer Support Handoff
- •Context: AI should hand off to a human for sensitive issues.
- •Open Code: "User said they were frustrated with a leak, AI just gave a generic maintenance link."
- •Axial Code: "Handoff Protocol Violation."
- •Judge Logic: "Search for sentiment indicating frustration or emergency. If found, did the AI offer a human transfer? If no, output True (Error)."
Common Pitfalls
- •Likert Scales: Using 1–5 scales makes it impossible to know if a change in score is meaningful. Use binary True/False.
- •Automating Too Early: Do not let an LLM do the initial "Open Coding." It lacks the product context to know what "janky" looks like for your specific business.
- •Committee Judging: Don't use a committee to define "good." Appoint one person with the best domain taste to be the final arbiter (The Benevolent Dictator).
- •Chasing Generic Metrics: Don't rely on generic evals like "hallucination score" or "cosine similarity." They rarely correlate with product-specific success.