Decision-First Observability: Focus on What Matters

Traditional observability was built for traditional software. It tracks:

Requests and responses
Errors and exceptions
Latency and throughput
Resource utilization

For AI agents, this misses the point entirely.

The Fundamental Difference

Traditional software executes deterministic logic. Given the same input, it produces the same output. Observability focuses on whether it executed correctly.

AI agents make decisions. Given the same input, they might choose differently based on context, confidence, and reasoning. Observability must focus on what they decided and why.

flowchart TD subgraph "Traditional Software" A1[Input] --> B1[Fixed Logic] B1 --> C1[Output] end subgraph "AI Agent" A2[Input] --> B2[Consider Options] B2 --> C2[Evaluate Context] C2 --> D2[Make Decision] D2 --> E2[Take Action] E2 --> F2[Observe Outcome] end

What Is a Decision?

A decision occurs when an agent:

Has multiple possible actions
Evaluates them against some criteria
Chooses one over the others
Acts on that choice

{
  "type": "decision",
  "timestamp": "2025-03-02T14:23:17Z",
  "agent": "customer-support-agent",
  "context": {
    "customer_id": "cust_8x7k2m",
    "issue_type": "billing_dispute",
    "customer_sentiment": "frustrated",
    "account_value": 12500,
    "previous_interactions": 3
  },
  "options": [
    {"action": "deny_refund", "confidence": 0.23},
    {"action": "partial_refund", "confidence": 0.31},
    {"action": "full_refund", "confidence": 0.67},
    {"action": "escalate_to_human", "confidence": 0.45}
  ],
  "decision": "full_refund",
  "reasoning": "High-value customer with legitimate concern. Cost of churn exceeds refund amount.",
  "confidence": 0.67
}

This single decision record tells you more than 10,000 action logs ever could.

Why Decisions Matter More Than Actions

Actions Are Implementation Details

Knowing that an agent "called the refund API" tells you nothing about whether it should have.

Decisions Reveal Reasoning

Understanding why an agent chose an action lets you evaluate whether that reasoning is sound.

Decisions Are Auditable

When a regulator asks "why did your AI do this?", you need to explain the decision, not the API call.

Decisions Enable Improvement

You can't improve what you don't understand. Decision logs let you:

Identify patterns in bad decisions
Understand confidence calibration
Find edge cases where reasoning fails

The Decision-First Architecture

flowchart LR subgraph "Agent Runtime" A[Observe] --> B[Orient] B --> C[Decide] C --> D[Act] end C --> E[Decision Log] D --> F[Action Result] F --> G[Outcome Log] E --> H[Empress] G --> H

Every agent follows an observe-orient-decide-act loop. Your observability should capture:

The decision (with full context)
The outcome (what actually happened)

The action itself is just plumbing.

Implementing Decision-First Logging

Capture the Decision Point

from empress import log_decision

# When your agent makes a choice
options = agent.evaluate_options(context)
chosen = agent.select_best(options)

log_decision(
    agent=agent.id,
    context=context.to_dict(),
    options=[{
        "action": opt.action,
        "confidence": opt.confidence,
        "reasoning": opt.reasoning
    } for opt in options],
    decision=chosen.action,
    confidence=chosen.confidence
)

# Then execute
result = agent.execute(chosen)

Capture the Outcome

from empress import log_outcome

log_outcome(
    agent=agent.id,
    decision_id=decision.id,  # Link to the decision
    result=result.status,
    metrics={
        "customer_satisfied": result.csat_score,
        "resolution_time_minutes": result.duration,
        "cost_incurred": result.cost
    }
)

Skip the Action Details

# DON'T do this
log("Calling refund API")
log("API returned 200")
log("Updating database")
log("Database updated")
log("Sending confirmation email")
log("Email sent")

# DO this
log_decision(decision)
# ... execute ...
log_outcome(outcome)

Decision Quality Metrics

Once you're logging decisions, you can measure decision quality:

Confidence Calibration

Are high-confidence decisions actually more likely to succeed?

SELECT
  FLOOR(confidence * 10) / 10 as confidence_bucket,
  COUNT(*) as decisions,
  AVG(CASE WHEN outcome = 'success' THEN 1 ELSE 0 END) as success_rate
FROM decisions
JOIN outcomes ON decisions.id = outcomes.decision_id
GROUP BY confidence_bucket
ORDER BY confidence_bucket

If 90% confidence decisions succeed 60% of the time, your agent is overconfident.

Option Diversity

Are agents considering enough options?

SELECT
  agent_id,
  AVG(options_count) as avg_options_considered
FROM decisions
GROUP BY agent_id

Agents consistently considering only one option aren't really deciding.

Reasoning Patterns

What reasoning leads to good vs bad outcomes?

This requires more sophisticated analysis, but decision logs make it possible.

The Empress Advantage

Empress is built for decision-first observability. The platform:

Structures decisions with first-class support for options, context, and reasoning
Links outcomes to the decisions that caused them
Visualizes decision patterns across agents and time
Alerts on decision anomalies not just action failures

Traditional observability tools can store decision data, but they're not built for it. Empress is.

Getting Started

Identify decision points in your agent logic
Instrument decisions with full context
Link outcomes back to decisions
Stop logging actions that don't add insight

Your observability will become smaller, cheaper, and dramatically more useful.

Focus on decisions. Everything else is noise.