Real-time observability is seductive. Watching metrics update live feels powerful. Dashboards with sub-second refresh rates feel sophisticated.
But real-time observability is expensive. And for most use cases, it's unnecessary.
The Cost of Real-Time
Real-time observability requires:
- Stream processing infrastructure (Kafka, Kinesis, Flink)
- Real-time storage (time-series databases, hot storage)
- Continuous queries (always-on compute)
- WebSocket connections (persistent server resources)
Versus batch:
- Batch storage (cheap object storage)
- Scheduled processing (run only when needed)
- Periodic queries (compute on demand)
- Standard HTTP (stateless, cacheable)
Cost difference: 5-20x
When Real-Time Matters
Real-time observability is justified when:
1. Immediate Action Is Required
If you need to act within seconds, you need to observe within seconds.
Examples:
- Fraud detection (block transaction before it completes)
- Safety systems (stop the robot before it crashes)
- Production incidents (alert on-call before users notice)
Test: "If I knew about this 5 minutes later, would outcomes be materially worse?"
2. Human Operators Are Watching
If someone is actively monitoring a screen, real-time feedback helps.
Examples:
- Live customer support observing agent interactions
- Operations center during high-traffic events
- Testing and debugging in development
Test: "Is there a human ready to act on this information right now?"
3. Cascading Failures Are Possible
Some failures compound rapidly. Early detection prevents escalation.
Examples:
- Agent retry storms
- Queue backlog buildup
- Resource exhaustion
Test: "Does the problem get 10x worse if I don't catch it in the first minute?"
When Batch Is Better
For everything else, batch observation is superior.
Performance Analysis
You don't need real-time metrics to analyze agent performance over the last week.
# Run daily, not continuously
daily_performance = empress.query("""
SELECT agent_id,
COUNT(*) as decisions,
AVG(confidence) as avg_confidence,
SUM(CASE WHEN outcome='success' THEN 1 ELSE 0 END) / COUNT(*) as success_rate
FROM decisions
WHERE date = CURRENT_DATE - 1
GROUP BY agent_id
""")
Cost Attribution
Calculating costs per agent, per task, per customer—none of this needs to be real-time.
# Run weekly
cost_report = empress.query("""
SELECT agent_id, customer_id,
SUM(token_cost) as total_cost,
COUNT(*) as actions
FROM agent_actions
WHERE date BETWEEN CURRENT_DATE - 7 AND CURRENT_DATE - 1
GROUP BY agent_id, customer_id
""")
Compliance Reporting
Auditors don't need real-time dashboards. They need accurate historical reports.
# Run monthly
compliance_report = empress.query("""
SELECT decision_type,
COUNT(*) as total,
SUM(CASE WHEN human_reviewed THEN 1 ELSE 0 END) as human_reviewed,
SUM(CASE WHEN human_reviewed THEN 1 ELSE 0 END) / COUNT(*) as review_rate
FROM decisions
WHERE date BETWEEN DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
AND DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 day'
GROUP BY decision_type
""")
Trend Analysis
Trends are, by definition, not real-time. They emerge over time.
# Run daily
trends = empress.query("""
SELECT DATE_TRUNC('day', timestamp) as day,
AVG(confidence) as avg_confidence
FROM decisions
WHERE timestamp > CURRENT_DATE - 30
GROUP BY day
ORDER BY day
""")
The Hybrid Approach
Most organizations need both. The key is using each appropriately.
Real-Time Layer
- Scope: Errors, anomalies, safety-critical events
- Volume: <1% of total events
- Retention: Hours to days (hot storage)
- Cost: Higher, but limited volume
Batch Layer
- Scope: All decisions, outcomes, audit trail
- Volume: 100% of meaningful events
- Retention: Months to years (cold storage)
- Cost: Lower, handles full volume
Implementation Pattern
At Event Time
def on_agent_event(event):
# Always: Write to batch storage (cheap)
batch_store.write(event)
# Conditionally: Send to real-time pipeline (expensive)
if requires_realtime(event):
realtime_pipeline.send(event)
def requires_realtime(event):
return (
event.type == "error" or
event.type == "anomaly" or
event.confidence < 0.5 or # Low confidence decisions
event.requires_human_approval
)
Dashboard Design
// Real-time: Only critical metrics
const realtimeMetrics = [
'active_errors',
'decisions_pending_human_review',
'queue_depth'
];
// Batch: Everything else
const batchMetrics = [
'daily_decision_count',
'weekly_success_rate',
'cost_per_task',
'agent_performance_ranking'
];
Alert Configuration
# Real-time alerts (immediate notification)
realtime_alerts:
- name: "Error Rate Spike"
condition: "error_rate > 0.05"
latency: "< 30 seconds"
- name: "Human Review Backlog"
condition: "pending_reviews > 100"
latency: "< 1 minute"
# Batch alerts (daily digest)
batch_alerts:
- name: "Low Confidence Trend"
condition: "avg_confidence < 0.7 for 3 days"
schedule: "daily"
- name: "Cost Anomaly"
condition: "daily_cost > 2x rolling_avg"
schedule: "daily"
The Empress Approach
Empress defaults to batch processing with real-time opt-in:
// Default: Batch (cost-effective)
empress.log(decision);
// Opt-in: Real-time when needed
empress.log(decision, { realtime: true });
// Automatic: Errors are always real-time
empress.logError(error); // Always real-time
Dashboards show batch metrics by default, with real-time views for operations centers.
The result: full observability at batch prices, with real-time capability when you need it.
Don't pay for real-time observability you don't need. Aggregate when you can. Stream only when you must.