Private BetaWe're currently in closed beta.Join the waitlist
BlogTechnical
TechnicalFebruary 24, 20255 min read

Real-Time vs Batch: When to Observe Live and When to Aggregate

Not everything needs real-time observability. Knowing when to aggregate saves money and improves signal quality.

Empress Team
AI Operations & Observability

Real-time observability is seductive. Watching metrics update live feels powerful. Dashboards with sub-second refresh rates feel sophisticated.

But real-time observability is expensive. And for most use cases, it's unnecessary.

The Cost of Real-Time

Real-time observability requires:

flowchart LR A[Event] --> B[Stream Processing] B --> C[Real-Time Store] C --> D[Live Dashboard] style B fill:#f97316,color:#fff style C fill:#f97316,color:#fff
  • Stream processing infrastructure (Kafka, Kinesis, Flink)
  • Real-time storage (time-series databases, hot storage)
  • Continuous queries (always-on compute)
  • WebSocket connections (persistent server resources)

Versus batch:

flowchart LR A[Events] --> B[Batch Storage] B --> C[Scheduled Processing] C --> D[Dashboard Refresh]
  • Batch storage (cheap object storage)
  • Scheduled processing (run only when needed)
  • Periodic queries (compute on demand)
  • Standard HTTP (stateless, cacheable)

Cost difference: 5-20x

When Real-Time Matters

Real-time observability is justified when:

1. Immediate Action Is Required

If you need to act within seconds, you need to observe within seconds.

Examples:

  • Fraud detection (block transaction before it completes)
  • Safety systems (stop the robot before it crashes)
  • Production incidents (alert on-call before users notice)

Test: "If I knew about this 5 minutes later, would outcomes be materially worse?"

2. Human Operators Are Watching

If someone is actively monitoring a screen, real-time feedback helps.

Examples:

  • Live customer support observing agent interactions
  • Operations center during high-traffic events
  • Testing and debugging in development

Test: "Is there a human ready to act on this information right now?"

3. Cascading Failures Are Possible

Some failures compound rapidly. Early detection prevents escalation.

Examples:

  • Agent retry storms
  • Queue backlog buildup
  • Resource exhaustion

Test: "Does the problem get 10x worse if I don't catch it in the first minute?"

When Batch Is Better

For everything else, batch observation is superior.

Performance Analysis

You don't need real-time metrics to analyze agent performance over the last week.

# Run daily, not continuously
daily_performance = empress.query("""
  SELECT agent_id,
         COUNT(*) as decisions,
         AVG(confidence) as avg_confidence,
         SUM(CASE WHEN outcome='success' THEN 1 ELSE 0 END) / COUNT(*) as success_rate
  FROM decisions
  WHERE date = CURRENT_DATE - 1
  GROUP BY agent_id
""")

Cost Attribution

Calculating costs per agent, per task, per customer—none of this needs to be real-time.

# Run weekly
cost_report = empress.query("""
  SELECT agent_id, customer_id,
         SUM(token_cost) as total_cost,
         COUNT(*) as actions
  FROM agent_actions
  WHERE date BETWEEN CURRENT_DATE - 7 AND CURRENT_DATE - 1
  GROUP BY agent_id, customer_id
""")

Compliance Reporting

Auditors don't need real-time dashboards. They need accurate historical reports.

# Run monthly
compliance_report = empress.query("""
  SELECT decision_type,
         COUNT(*) as total,
         SUM(CASE WHEN human_reviewed THEN 1 ELSE 0 END) as human_reviewed,
         SUM(CASE WHEN human_reviewed THEN 1 ELSE 0 END) / COUNT(*) as review_rate
  FROM decisions
  WHERE date BETWEEN DATE_TRUNC('month', CURRENT_DATE - INTERVAL '1 month')
                 AND DATE_TRUNC('month', CURRENT_DATE) - INTERVAL '1 day'
  GROUP BY decision_type
""")

Trend Analysis

Trends are, by definition, not real-time. They emerge over time.

# Run daily
trends = empress.query("""
  SELECT DATE_TRUNC('day', timestamp) as day,
         AVG(confidence) as avg_confidence
  FROM decisions
  WHERE timestamp > CURRENT_DATE - 30
  GROUP BY day
  ORDER BY day
""")

The Hybrid Approach

Most organizations need both. The key is using each appropriately.

flowchart TD A[Agent Event] --> B{Needs Immediate Action?} B -->|Yes| C[Real-Time Pipeline] B -->|No| D[Batch Pipeline] C --> E[Real-Time Alerts] C --> F[Live Dashboard] D --> G[Data Lake] G --> H[Scheduled Analysis] H --> I[Reports & Insights]

Real-Time Layer

  • Scope: Errors, anomalies, safety-critical events
  • Volume: <1% of total events
  • Retention: Hours to days (hot storage)
  • Cost: Higher, but limited volume

Batch Layer

  • Scope: All decisions, outcomes, audit trail
  • Volume: 100% of meaningful events
  • Retention: Months to years (cold storage)
  • Cost: Lower, handles full volume

Implementation Pattern

At Event Time

def on_agent_event(event):
    # Always: Write to batch storage (cheap)
    batch_store.write(event)

    # Conditionally: Send to real-time pipeline (expensive)
    if requires_realtime(event):
        realtime_pipeline.send(event)

def requires_realtime(event):
    return (
        event.type == "error" or
        event.type == "anomaly" or
        event.confidence < 0.5 or  # Low confidence decisions
        event.requires_human_approval
    )

Dashboard Design

// Real-time: Only critical metrics
const realtimeMetrics = [
  'active_errors',
  'decisions_pending_human_review',
  'queue_depth'
];

// Batch: Everything else
const batchMetrics = [
  'daily_decision_count',
  'weekly_success_rate',
  'cost_per_task',
  'agent_performance_ranking'
];

Alert Configuration

# Real-time alerts (immediate notification)
realtime_alerts:
  - name: "Error Rate Spike"
    condition: "error_rate > 0.05"
    latency: "&#x3C; 30 seconds"

  - name: "Human Review Backlog"
    condition: "pending_reviews > 100"
    latency: "&#x3C; 1 minute"

# Batch alerts (daily digest)
batch_alerts:
  - name: "Low Confidence Trend"
    condition: "avg_confidence &#x3C; 0.7 for 3 days"
    schedule: "daily"

  - name: "Cost Anomaly"
    condition: "daily_cost > 2x rolling_avg"
    schedule: "daily"

The Empress Approach

Empress defaults to batch processing with real-time opt-in:

// Default: Batch (cost-effective)
empress.log(decision);

// Opt-in: Real-time when needed
empress.log(decision, { realtime: true });

// Automatic: Errors are always real-time
empress.logError(error);  // Always real-time

Dashboards show batch metrics by default, with real-time views for operations centers.

The result: full observability at batch prices, with real-time capability when you need it.

Don't pay for real-time observability you don't need. Aggregate when you can. Stream only when you must.

Share this article
Now in private beta

Ready to see what your AI agents do?

Complete observability for autonomous systems. One platform for compliance, operations, and intelligence.