Your AI agent made a bad decision. A customer is unhappy. Leadership wants answers.
"Why did this happen?" is a harder question for AI agents than traditional software. The behavior isn't deterministic. The reasoning isn't explicit. The inputs aren't always logged.
This is why debugging AI agents requires a different approach.
The Debugging Challenge
Traditional debugging: Read the stack trace, find the bug, fix the code.
AI agent debugging:
The failure could be in the input, the model, the prompt, the execution, or the outcome measurement. Often it's a combination.
The Five Whys for AI
Start with the symptom, dig to the cause:
Symptom: Customer received an inappropriate refund denial
-
Why was the refund denied? The agent decided the request was outside policy.
-
Why did the agent think it was outside policy? The order was 94 days old, exceeding the 90-day window.
-
Why didn't the agent escalate? Confidence was 0.92, above the 0.85 escalation threshold.
-
Why was confidence so high for a borderline case? The prompt didn't emphasize edge case uncertainty.
-
Why doesn't the prompt address edge cases? This scenario wasn't in the test suite.
Root cause: Missing test coverage for edge cases near policy boundaries.
The Debugging Stack
Layer by layer investigation:
Layer 1: Outcome Analysis
What actually happened?
{
"investigation": "case-4892",
"outcome": {
"expected": "refund approved or escalated",
"actual": "refund denied",
"customer_impact": "high",
"escalation_triggered": false
}
}
Layer 2: Decision Analysis
What decision was made and why?
{
"decision": {
"action": "deny_refund",
"confidence": 0.92,
"reasoning": "Order exceeds 90-day return window",
"alternatives_considered": [
{"action": "approve", "score": 0.05},
{"action": "escalate", "score": 0.03}
]
}
}
Layer 3: Context Analysis
What information was available?
{
"context": {
"customer": {
"tenure_years": 5,
"lifetime_value": 12000,
"satisfaction_history": "excellent"
},
"order": {
"age_days": 94,
"amount": 127.50,
"product_condition": "unopened"
},
"policy": {
"return_window_days": 90,
"exceptions_allowed": true
}
}
}
Wait—exceptions_allowed: true. The agent had the information to make an exception but didn't.
Layer 4: Prompt Analysis
What instructions guided the decision?
# Current Prompt (problematic)
You are a refund agent. Apply the 90-day return policy strictly.
# Better Prompt
You are a refund agent. The standard return window is 90 days.
For orders slightly outside the window (90-100 days), consider:
- Customer history and lifetime value
- Product condition
- Reason for delay
If in doubt, escalate to human review.
Layer 5: Model Analysis
Did the model behave as expected given the prompt?
Compare with similar cases:
- Same prompt, similar input, different output?
- Different prompt, same input, better output?
The Debugging Checklist
Work through systematically:
-
Reproduce the issue
- Same input produces same output?
- Consistent across multiple runs?
-
Verify the context
- All relevant information was available?
- Information was correctly formatted?
- No data corruption or staleness?
-
Analyze the decision
- Decision matches the reasoning?
- Confidence level appropriate?
- Alternatives reasonably scored?
-
Review the prompt
- Instructions clear for this scenario?
- Edge cases addressed?
- Escalation criteria defined?
-
Test the model
- Model capable of correct answer?
- Temperature/parameters appropriate?
- Model version as expected?
-
Check the execution
- Decision executed correctly?
- Side effects as intended?
- Error handling worked?
Building Debug Tooling
Essential debugging capabilities:
Decision Replay
Re-run a decision with the same context:
empress debug replay --action-id act-4892-a7b8c9
# Output:
Original decision: deny_refund (confidence: 0.92)
Replay decision: deny_refund (confidence: 0.91)
Result: Consistent reproduction
Prompt Experimentation
Test alternative prompts:
empress debug prompt-test \
--action-id act-4892-a7b8c9 \
--prompt-variant prompts/refund-v2.md
# Output:
Original prompt: deny_refund
New prompt: escalate_to_human
Confidence delta: -0.34
Counterfactual Analysis
What if the context was different?
empress debug counterfactual \
--action-id act-4892-a7b8c9 \
--modify "order.age_days=89"
# Output:
Original: deny_refund
Counterfactual: approve_refund
Boundary identified: order.age_days=90
Batch Analysis
Find similar cases:
empress debug similar \
--action-id act-4892-a7b8c9 \
--limit 50
# Output:
Found 12 similar decisions
- 8 correctly handled
- 4 potential issues
Pattern Recognition
Common failure patterns:
The Confidence Trap
High confidence in wrong decision because:
- Training data didn't include edge cases
- Prompt encourages certainty
- Model overconfident on unfamiliar patterns
Fix: Add uncertainty calibration, lower confidence thresholds for edge cases.
The Context Gap
Decision made without crucial information:
- Data not passed to agent
- Data format unexpected
- Data stale or corrupted
Fix: Validate context completeness before decisions.
The Prompt Blindspot
Prompt doesn't address the scenario:
- Edge case not anticipated
- Instructions ambiguous
- Conflicting guidance
Fix: Expand prompt coverage, add explicit edge case handling.
The Execution Failure
Correct decision, wrong action:
- API call failed silently
- Side effect not completed
- State not updated
Fix: Improve execution validation and error handling.
Post-Incident Documentation
After debugging, document:
## Incident: Case #4892
**Summary**: Customer denied refund for 94-day-old order despite policy allowing exceptions.
**Root Cause**: Prompt instructed strict policy application without guidance for edge cases.
**Impact**: 1 customer affected, CSAT -1, remediation completed.
**Fix**: Updated prompt to include edge case handling. Added escalation trigger for orders 90-100 days old.
**Prevention**: Added test cases for policy boundary scenarios.
**Timeline**:
- 14:32 Decision made
- 15:47 Customer complaint received
- 16:20 Investigation started
- 17:10 Root cause identified
- 18:00 Fix deployed
The Empress Approach
Empress provides debugging tools built for AI agents:
- Full context capture at decision time
- Decision replay with identical inputs
- Prompt experimentation sandbox
- Counterfactual analysis for boundary testing
- Pattern detection across similar failures
When things go wrong, you have the tools to understand why—and the data to prevent recurrence.
Debugging AI agents is hard. Having the right observability makes it systematic.