Real-Time vs Batch: When to Observe Live and When to Aggregate

Real-time observability is seductive. Watching metrics update live feels powerful. Dashboards with sub-second refresh rates feel sophisticated.

But real-time observability is expensive. And for most use cases, it's unnecessary.

The Cost of Real-Time

Real-time observability requires:

flowchart LR A[Event] --> B[Stream Processing] B --> C[Real-Time Store] C --> D[Live Dashboard] style B fill:#f97316,color:#fff style C fill:#f97316,color:#fff

Stream processing infrastructure (Kafka, Kinesis, Flink)
Real-time storage (time-series databases, hot storage)
Continuous queries (always-on compute)
WebSocket connections (persistent server resources)

Versus batch:

flowchart LR A[Events] --> B[Batch Storage] B --> C[Scheduled Processing] C --> D[Dashboard Refresh]

Batch storage (cheap object storage)
Scheduled processing (run only when needed)
Periodic queries (compute on demand)
Standard HTTP (stateless, cacheable)

Cost difference: 5-20x

When Real-Time Matters

Real-time observability is justified when:

1. Immediate Action Is Required

If you need to act within seconds, you need to observe within seconds.

Examples:

Fraud detection (block transaction before it completes)
Safety systems (stop the robot before it crashes)
Production incidents (alert on-call before users notice)

Test: "If I knew about this 5 minutes later, would outcomes be materially worse?"

2. Human Operators Are Watching

If someone is actively monitoring a screen, real-time feedback helps.

Examples:

Live customer support observing agent interactions
Operations center during high-traffic events
Testing and debugging in development

Test: "Is there a human ready to act on this information right now?"

3. Cascading Failures Are Possible

Some failures compound rapidly. Early detection prevents escalation.

Examples:

Agent retry storms
Queue backlog buildup
Resource exhaustion

Test: "Does the problem get 10x worse if I don't catch it in the first minute?"

When Batch Is Better

For everything else, batch observation is superior.

Performance Analysis

You don't need real-time metrics to analyze agent performance over the last week.

# Run daily, not continuously
daily_performance = empress.query("""
  SELECT agent_id,
         COUNT(*) as decisions,
         AVG(confidence) as avg_confidence,
         SUM(CASE WHEN outcome='success' THEN 1 ELSE 0 END) / COUNT(*) as success_rate
  FROM decisions
  WHERE date = CURRENT_DATE - 1
  GROUP BY agent_id
""")

Cost Attribution

Calculating costs per agent, per task, per customer—none of this needs to be real-time.

# Run weekly
cost_report = empress.query("""
  SELECT agent_id, customer_id,
         SUM(token_cost) as total_cost,
         COUNT(*) as actions
  FROM agent_actions
  WHERE date BETWEEN CURRENT_DATE - 7 AND CURRENT_DATE - 1
  GROUP BY agent_id, customer_id
""")

Compliance Reporting

Auditors don't need real-time dashboards. They need accurate historical reports.

# Run monthly
compliance_report = empress.query("""
  SELECT decision_type,
         COUNT(*) as total,
         SUM(CASE WHEN human_reviewed THEN 1 ELSE 0 END) as human_reviewed,
         SUM(CASE WHEN human_reviewed THEN 1 ELSE 0 END) / COUNT(*) as review_rate
  FROM decisions
  WHERE date BETWEEN DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
                 AND DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 day'
  GROUP BY decision_type
""")

Trend Analysis

Trends are, by definition, not real-time. They emerge over time.

# Run daily
trends = empress.query("""
  SELECT DATE_TRUNC('day', timestamp) as day,
         AVG(confidence) as avg_confidence
  FROM decisions
  WHERE timestamp > CURRENT_DATE - 30
  GROUP BY day
  ORDER BY day
""")

The Hybrid Approach

Most organizations need both. The key is using each appropriately.

flowchart TD A[Agent Event] --> B{Needs Immediate Action?} B -->|Yes| C[Real-Time Pipeline] B -->|No| D[Batch Pipeline] C --> E[Real-Time Alerts] C --> F[Live Dashboard] D --> G[Data Lake] G --> H[Scheduled Analysis] H --> I[Reports & Insights]

Real-Time Layer

Scope: Errors, anomalies, safety-critical events
Volume: <1% of total events
Retention: Hours to days (hot storage)
Cost: Higher, but limited volume

Batch Layer

Scope: All decisions, outcomes, audit trail
Volume: 100% of meaningful events
Retention: Months to years (cold storage)
Cost: Lower, handles full volume

Implementation Pattern

At Event Time

def on_agent_event(event):
    # Always: Write to batch storage (cheap)
    batch_store.write(event)

    # Conditionally: Send to real-time pipeline (expensive)
    if requires_realtime(event):
        realtime_pipeline.send(event)

def requires_realtime(event):
    return (
        event.type == "error" or
        event.type == "anomaly" or
        event.confidence < 0.5 or  # Low confidence decisions
        event.requires_human_approval
    )

Dashboard Design

// Real-time: Only critical metrics
const realtimeMetrics = [
  'active_errors',
  'decisions_pending_human_review',
  'queue_depth'
];

// Batch: Everything else
const batchMetrics = [
  'daily_decision_count',
  'weekly_success_rate',
  'cost_per_task',
  'agent_performance_ranking'
];

Alert Configuration

# Real-time alerts (immediate notification)
realtime_alerts:
  - name: "Error Rate Spike"
    condition: "error_rate > 0.05"
    latency: "&#x3C; 30 seconds"

  - name: "Human Review Backlog"
    condition: "pending_reviews > 100"
    latency: "&#x3C; 1 minute"

# Batch alerts (daily digest)
batch_alerts:
  - name: "Low Confidence Trend"
    condition: "avg_confidence &#x3C; 0.7 for 3 days"
    schedule: "daily"

  - name: "Cost Anomaly"
    condition: "daily_cost > 2x rolling_avg"
    schedule: "daily"

The Empress Approach

Empress defaults to batch processing with real-time opt-in:

// Default: Batch (cost-effective)
empress.log(decision);

// Opt-in: Real-time when needed
empress.log(decision, { realtime: true });

// Automatic: Errors are always real-time
empress.logError(error);  // Always real-time

Dashboards show batch metrics by default, with real-time views for operations centers.

The result: full observability at batch prices, with real-time capability when you need it.

Don't pay for real-time observability you don't need. Aggregate when you can. Stream only when you must.

Real-Time vs Batch: When to Observe Live and When to Aggregate

The Cost of Real-Time

When Real-Time Matters

1. Immediate Action Is Required

2. Human Operators Are Watching

3. Cascading Failures Are Possible

When Batch Is Better

Performance Analysis

Cost Attribution

Compliance Reporting

Trend Analysis

The Hybrid Approach

Real-Time Layer

Batch Layer

Implementation Pattern

At Event Time

Dashboard Design

Alert Configuration

The Empress Approach

Related articles

Decision-First Observability: Focus on What Matters

Multi-Agent Coordination: Patterns for Complex Workflows

Log Levels for AI Agents: Beyond DEBUG, INFO, WARN, ERROR

Ready to see what your AI agents do?