Convergence Detection Debugger
Overview
The AI Counsel convergence detection system (deliberation/convergence.py) determines when models have reached consensus and can stop deliberating early. It uses semantic similarity comparison between consecutive rounds, with voting outcomes taking precedence when available.
Common Issue: Convergence not detected → wasted API calls Common Issue: Early stopping not triggering → deliberation runs full max_rounds Common Issue: Semantic vs voting status conflict → confusing results
Diagnostic Workflow
Step 1: Examine the Transcript
What to look for:
- •Are responses actually similar between rounds?
- •Is there a "Convergence Information" section?
- •What's the reported status and similarity scores?
- •Are there votes? What's the voting outcome?
File location:
# Transcripts are in project root ls -lt transcripts/*.md | head -5 # Open most recent open "transcripts/$(ls -t transcripts/*.md | head -1)"
Convergence section example:
## Convergence Information - **Status**: refining (40.00% - 85.00% similarity) - **Average Similarity**: 72.31% - **Minimum Similarity**: 68.45%
Voting section example (overrides semantic status):
## Final Voting Results - **Winner**: TypeScript ✓ - **Status**: majority_decision - **Tally**: TypeScript: 2, JavaScript: 1
Missing convergence section?
→ Check if round_number <= min_rounds_before_check (see Step 2)
Step 2: Check Configuration
Read the config:
cat config.yaml
Key settings to verify:
deliberation:
convergence_detection:
enabled: true # Must be true
semantic_similarity_threshold: 0.85 # Convergence if ALL participants >= this
divergence_threshold: 0.40 # Diverging if ANY participant < this
min_rounds_before_check: 1 # Must be <= (total_rounds - 1)
consecutive_stable_rounds: 2 # Require this many stable rounds
early_stopping:
enabled: true # Must be true for model-controlled stopping
threshold: 0.66 # Fraction of models that must want to stop (2/3)
respect_min_rounds: true # Won't stop before defaults.rounds
Common misconfigurations:
| Problem | Cause | Fix |
|---|---|---|
| No convergence info in transcript | min_rounds_before_check too high | For 2-round deliberation, use min_rounds_before_check: 1 |
| Early stopping never triggers | respect_min_rounds: true but models converge before defaults.rounds | Set to false or reduce defaults.rounds |
| Convergence threshold too strict | semantic_similarity_threshold: 0.95 | Lower to 0.80-0.85 for practical convergence |
| Everything marked "diverging" | divergence_threshold: 0.70 too high | Use default 0.40 (models rarely agree <40%) |
Step 3: Check Backend Selection
The system auto-selects the best available backend:
- •SentenceTransformerBackend (best) - requires
sentence-transformers - •TFIDFBackend (good) - requires
scikit-learn - •JaccardBackend (fallback) - zero dependencies (word overlap)
Check what's installed:
python -c "import sentence_transformers; print('✓ SentenceTransformer available')" 2>/dev/null || echo "✗ SentenceTransformer not available"
python -c "import sklearn; print('✓ TF-IDF available')" 2>/dev/null || echo "✗ TF-IDF not available"
Check what backend was used:
# Look at server logs tail -50 mcp_server.log | grep -i "backend\|similarity"
Expected log output:
INFO - ConvergenceDetector initialized with SentenceTransformerBackend INFO - Using SentenceTransformerBackend (best accuracy)
If using Jaccard (fallback):
- •Similarity scores will be lower (word overlap only)
- •Semantic paraphrasing NOT detected
- •Consider installing optional dependencies:
# Install enhanced backends pip install -r requirements-optional.txt # Or individually pip install sentence-transformers # Best pip install scikit-learn # Good
Step 4: Debug Semantic Similarity Scores
Scenario: Responses look identical but similarity is low
Possible causes:
- •Using Jaccard backend (doesn't understand semantics)
- •Responses have different formatting/structure
- •Models added different examples/details
Test similarity manually:
# Create test script: test_similarity.py
from deliberation.convergence import ConvergenceDetector
from models.config import load_config
config = load_config("config.yaml")
detector = ConvergenceDetector(config)
text1 = "I prefer TypeScript for type safety and better tooling"
text2 = "TypeScript is better because it has types and good IDE support"
score = detector.backend.compute_similarity(text1, text2)
print(f"Similarity: {score:.2%}")
print(f"Backend: {detector.backend.__class__.__name__}")
Run it:
python test_similarity.py
Expected results by backend:
- •SentenceTransformer: 75-85% (understands semantic similarity)
- •TF-IDF: 50-65% (word importance weighting)
- •Jaccard: 30-45% (simple word overlap)
Step 5: Debug Voting vs Semantic Status Conflicts
The voting outcome ALWAYS overrides semantic similarity status when votes are present.
Status precedence (highest to lowest):
- •unanimous_consensus - All models voted same option
- •majority_decision - 2+ models agreed (e.g., 2-1 vote)
- •tie - Equal votes for all options (e.g., 1-1-1)
- •Semantic status - Only used if no votes present:
- •
converged- All participants ≥85% similar - •
refining- Between 40-85% similarity - •
diverging- Any participant <40% similar - •
impasse- Stable disagreement over multiple rounds
- •
Check if votes are being parsed:
# Search transcript for VOTE markers grep -A 5 "VOTE:" "transcripts/$(ls -t transcripts/*.md | head -1)"
Expected vote format in model responses:
VOTE: {"option": "TypeScript", "confidence": 0.85, "rationale": "Type safety is crucial", "continue_debate": false}
If votes aren't being parsed:
- •Check that models are outputting exact
VOTE: {json}format - •Verify JSON is valid (use online JSON validator)
- •Check logs for parsing errors:
grep -i "vote\|parse" mcp_server.log
Step 6: Debug Early Stopping
Early stopping requires:
- •
early_stopping.enabled: truein config - •At least
thresholdfraction of models setcontinue_debate: false - •Current round ≥
defaults.rounds(ifrespect_min_rounds: true)
Example: 3 models, threshold 0.66 (66%)
- •Round 1: All say
continue_debate: true→ continues - •Round 2: 2 models say
continue_debate: false→ stops (2/3 = 66.7%)
Debug steps:
- •Check if enabled:
grep -A 3 "early_stopping:" config.yaml
- •Check model votes in transcript:
# Look for continue_debate flags grep -i "continue_debate" "transcripts/$(ls -t transcripts/*.md | head -1)"
- •Check logs for early stop decision:
grep -i "early stop\|continue_debate" mcp_server.log | tail -20
- •Common issues:
| Problem | Cause | Solution |
|---|---|---|
| Models want to stop but deliberation continues | respect_min_rounds: true and not at min rounds yet | Wait for min rounds or set to false |
| Threshold not met | Only 1/3 models want to stop (33% < 66%) | Need 2/3 consensus |
| Not enabled | enabled: false | Set to true |
| Models not outputting flag | Vote JSON missing continue_debate field | Add to model prompts |
Step 7: Debug Impasse Detection
Impasse = stable disagreement over multiple rounds
Requirements:
- •Status is
diverging(min_similarity < 0.40) - •
consecutive_stable_roundsthreshold reached (default: 2)
Check impasse logic:
# Read convergence.py lines 380-385 # Impasse is only detected if diverging AND stable
Common issue: Never reaches impasse
- •Models keep changing positions → not stable
- •Divergence threshold too low → never marks as "diverging"
- •Need at least 2-3 rounds of consistent disagreement
Manual check:
# Look at similarity scores across rounds grep "Minimum Similarity" "transcripts/$(ls -t transcripts/*.md | head -1)"
If similarity jumps around (45% → 25% → 60%): → Models aren't stable, impasse won't trigger
Step 8: Performance Diagnostics
If convergence detection is slow:
- •Check if SentenceTransformer is downloading models:
# First run downloads ~500MB model tail -f mcp_server.log | grep -i "loading\|download"
- •Model is cached after first load:
- •Subsequent deliberations are instant (model reused from memory)
- •Cache is per-process (each server restart reloads)
- •Check computation time:
# Add timing to test script
import time
start = time.time()
score = detector.backend.compute_similarity(text1, text2)
elapsed = time.time() - start
print(f"Computation time: {elapsed*1000:.2f}ms")
Expected times:
- •SentenceTransformer: 50-200ms per comparison (first run slower)
- •TF-IDF: 10-50ms per comparison
- •Jaccard: <1ms per comparison
Quick Reference
Configuration Parameters
# Convergence thresholds semantic_similarity_threshold: 0.85 # Range: 0.0-1.0, higher = stricter divergence_threshold: 0.40 # Range: 0.0-1.0, lower = more sensitive # Round constraints min_rounds_before_check: 1 # Must be <= (total_rounds - 1) consecutive_stable_rounds: 2 # Stability requirement # Early stopping early_stopping.threshold: 0.66 # Fraction of models needed (0.5 = majority) respect_min_rounds: true # Honor defaults.rounds minimum
Status Definitions
| Status | Meaning | Similarity Range |
|---|---|---|
| converged | All participants agree | ≥85% (by default) |
| refining | Moderate agreement | 40-85% |
| diverging | Low agreement | <40% |
| impasse | Stable disagreement | <40% for 2+ rounds |
| unanimous_consensus | All voted same (overrides semantic) | N/A (voting) |
| majority_decision | 2+ voted same (overrides semantic) | N/A (voting) |
| tie | Equal votes (overrides semantic) | N/A (voting) |
Common Fixes
Problem: No convergence info in transcript
# Fix: Lower min_rounds_before_check min_rounds_before_check: 1 # For 2-round deliberations
Problem: Never converges despite identical responses
# Fix: Install better backend pip install sentence-transformers
Problem: Early stopping not working
# Fix: Check these settings early_stopping: enabled: true threshold: 0.66 respect_min_rounds: false # Allow stopping before min rounds
Problem: Everything marked "diverging"
# Fix: Lower divergence threshold divergence_threshold: 0.40 # Default (not 0.70)
Files to Check
- •Config:
/Users/harrison/Github/ai-counsel/config.yaml - •Engine:
/Users/harrison/Github/ai-counsel/deliberation/convergence.py - •Transcripts:
/Users/harrison/Github/ai-counsel/transcripts/*.md - •Logs:
/Users/harrison/Github/ai-counsel/mcp_server.log - •Schema:
/Users/harrison/Github/ai-counsel/models/schema.py(Vote models) - •Config models:
/Users/harrison/Github/ai-counsel/models/config.py
Testing Convergence Detection
Create integration test:
# tests/integration/test_convergence_debug.py
import pytest
from deliberation.convergence import ConvergenceDetector
from models.config import load_config
from models.schema import RoundResponse, Participant
def test_convergence_identical_responses():
"""Test that identical responses trigger convergence."""
config = load_config("config.yaml")
detector = ConvergenceDetector(config)
# Create identical responses
round1 = [
RoundResponse(participant="claude", response="TypeScript is best", vote=None),
RoundResponse(participant="codex", response="TypeScript is best", vote=None),
]
round2 = [
RoundResponse(participant="claude", response="TypeScript is best", vote=None),
RoundResponse(participant="codex", response="TypeScript is best", vote=None),
]
result = detector.check_convergence(round2, round1, round_number=2)
assert result is not None, "Should check convergence at round 2"
assert result.avg_similarity > 0.90, f"Identical responses should be >90% similar, got {result.avg_similarity}"
print(f"Backend: {detector.backend.__class__.__name__}")
print(f"Similarity: {result.avg_similarity:.2%}")
print(f"Status: {result.status}")
Run test:
pytest tests/integration/test_convergence_debug.py -v -s
Advanced Debugging
Enable Debug Logging
# Add to deliberation/engine.py or server.py import logging logging.basicConfig(level=logging.DEBUG)
Inspect Backend State
# In Python shell or test
from deliberation.convergence import ConvergenceDetector
from models.config import load_config
config = load_config("config.yaml")
detector = ConvergenceDetector(config)
print(f"Backend: {detector.backend.__class__.__name__}")
print(f"Threshold: {detector.config.semantic_similarity_threshold}")
print(f"Min rounds: {detector.config.min_rounds_before_check}")
print(f"Consecutive stable: {detector.config.consecutive_stable_rounds}")
Compare All Backends
# test_all_backends.py
from deliberation.convergence import (
JaccardBackend,
TFIDFBackend,
SentenceTransformerBackend
)
text1 = "I prefer TypeScript for type safety"
text2 = "TypeScript is better because it has types"
backends = {
"Jaccard": JaccardBackend(),
}
try:
backends["TF-IDF"] = TFIDFBackend()
except ImportError:
print("TF-IDF not available")
try:
backends["SentenceTransformer"] = SentenceTransformerBackend()
except ImportError:
print("SentenceTransformer not available")
for name, backend in backends.items():
score = backend.compute_similarity(text1, text2)
print(f"{name:20s}: {score:.2%}")
Summary
Always check in order:
- •✅ Transcript - Does convergence info appear?
- •✅ Config - Are thresholds reasonable?
- •✅ Backend - Is the best backend installed?
- •✅ Voting - Are votes being parsed correctly?
- •✅ Early stopping - Is it enabled and configured correctly?
- •✅ Logs - Any errors or warnings?
Most common fixes:
- •Lower
min_rounds_before_checkto 1 for short deliberations - •Install
sentence-transformersfor better semantic detection - •Set
early_stopping.respect_min_rounds: falsefor faster stopping - •Lower
semantic_similarity_thresholdfrom 0.95 to 0.85 - •Check that models output valid
VOTE:JSON withcontinue_debatefield