Traditional observability was built for traditional software. It tracks:
- Requests and responses
- Errors and exceptions
- Latency and throughput
- Resource utilization
For AI agents, this misses the point entirely.
The Fundamental Difference
Traditional software executes deterministic logic. Given the same input, it produces the same output. Observability focuses on whether it executed correctly.
AI agents make decisions. Given the same input, they might choose differently based on context, confidence, and reasoning. Observability must focus on what they decided and why.
What Is a Decision?
A decision occurs when an agent:
- Has multiple possible actions
- Evaluates them against some criteria
- Chooses one over the others
- Acts on that choice
{
"type": "decision",
"timestamp": "2025-03-02T14:23:17Z",
"agent": "customer-support-agent",
"context": {
"customer_id": "cust_8x7k2m",
"issue_type": "billing_dispute",
"customer_sentiment": "frustrated",
"account_value": 12500,
"previous_interactions": 3
},
"options": [
{"action": "deny_refund", "confidence": 0.23},
{"action": "partial_refund", "confidence": 0.31},
{"action": "full_refund", "confidence": 0.67},
{"action": "escalate_to_human", "confidence": 0.45}
],
"decision": "full_refund",
"reasoning": "High-value customer with legitimate concern. Cost of churn exceeds refund amount.",
"confidence": 0.67
}
This single decision record tells you more than 10,000 action logs ever could.
Why Decisions Matter More Than Actions
Actions Are Implementation Details
Knowing that an agent "called the refund API" tells you nothing about whether it should have.
Decisions Reveal Reasoning
Understanding why an agent chose an action lets you evaluate whether that reasoning is sound.
Decisions Are Auditable
When a regulator asks "why did your AI do this?", you need to explain the decision, not the API call.
Decisions Enable Improvement
You can't improve what you don't understand. Decision logs let you:
- Identify patterns in bad decisions
- Understand confidence calibration
- Find edge cases where reasoning fails
The Decision-First Architecture
Every agent follows an observe-orient-decide-act loop. Your observability should capture:
- The decision (with full context)
- The outcome (what actually happened)
The action itself is just plumbing.
Implementing Decision-First Logging
Capture the Decision Point
from empress import log_decision
# When your agent makes a choice
options = agent.evaluate_options(context)
chosen = agent.select_best(options)
log_decision(
agent=agent.id,
context=context.to_dict(),
options=[{
"action": opt.action,
"confidence": opt.confidence,
"reasoning": opt.reasoning
} for opt in options],
decision=chosen.action,
confidence=chosen.confidence
)
# Then execute
result = agent.execute(chosen)
Capture the Outcome
from empress import log_outcome
log_outcome(
agent=agent.id,
decision_id=decision.id, # Link to the decision
result=result.status,
metrics={
"customer_satisfied": result.csat_score,
"resolution_time_minutes": result.duration,
"cost_incurred": result.cost
}
)
Skip the Action Details
# DON'T do this
log("Calling refund API")
log("API returned 200")
log("Updating database")
log("Database updated")
log("Sending confirmation email")
log("Email sent")
# DO this
log_decision(decision)
# ... execute ...
log_outcome(outcome)
Decision Quality Metrics
Once you're logging decisions, you can measure decision quality:
Confidence Calibration
Are high-confidence decisions actually more likely to succeed?
SELECT
FLOOR(confidence * 10) / 10 as confidence_bucket,
COUNT(*) as decisions,
AVG(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) as success_rate
FROM decisions
JOIN outcomes ON decisions.id = outcomes.decision_id
GROUP BY confidence_bucket
ORDER BY confidence_bucket
If 90% confidence decisions succeed 60% of the time, your agent is overconfident.
Option Diversity
Are agents considering enough options?
SELECT
agent_id,
AVG(options_count) as avg_options_considered
FROM decisions
GROUP BY agent_id
Agents consistently considering only one option aren't really deciding.
Reasoning Patterns
What reasoning leads to good vs bad outcomes?
This requires more sophisticated analysis, but decision logs make it possible.
The Empress Advantage
Empress is built for decision-first observability. The platform:
- Structures decisions with first-class support for options, context, and reasoning
- Links outcomes to the decisions that caused them
- Visualizes decision patterns across agents and time
- Alerts on decision anomalies not just action failures
Traditional observability tools can store decision data, but they're not built for it. Empress is.
Getting Started
- Identify decision points in your agent logic
- Instrument decisions with full context
- Link outcomes back to decisions
- Stop logging actions that don't add insight
Your observability will become smaller, cheaper, and dramatically more useful.
Focus on decisions. Everything else is noise.