The observability strategy that works for your pilot deployment will not work at scale.
I've watched teams learn this lesson the hard way—usually around 3 AM when their logging infrastructure falls over during a traffic spike.
Here's how to evolve your observability as you scale.
The Scaling Stages
Each stage requires fundamental changes to your approach.
Stage 1: Pilot (1-10 Agents)
What Works
At pilot scale, you can log almost everything:
# This is fine for 10 agents
def log_everything(event):
empress.log({
"timestamp": event.timestamp,
"agent": event.agent,
"type": event.type,
"details": event.to_dict(), # Full event data
"context": get_full_context(), # All context
"trace": get_full_trace() # Complete trace
})
- Full event details
- Complete context
- Detailed traces
- Real-time everything
Volume
Agents: 10
Events per agent per day: 1,000
Total daily events: 10,000
Daily storage: 10 MB
Monthly cost: ~$50
Mistakes to Avoid
Don't get attached to this approach. It won't scale.
Stage 2: Production (10-100 Agents)
What Changes
At 100 agents, your log volume is 10x pilot. Costs start to matter.
# Start being selective
def log_selectively(event):
if event.is_decision or event.is_outcome or event.is_error:
empress.log({
"timestamp": event.timestamp,
"agent": event.agent,
"type": event.type,
"summary": event.summarize(), # Not full details
"key_context": event.key_factors(), # Not all context
})
Key Adjustments
| Pilot Approach | Production Approach |
|---|---|
| Log full events | Log summaries |
| All context | Key factors only |
| Real-time everything | Real-time for alerts only |
| Detailed traces | Sampled traces |
Volume
Agents: 100
Events per agent per day: 1,000
Total daily events: 100,000
With selective logging: 10,000
Daily storage: 10 MB
Monthly cost: ~$200
New Capabilities Needed
- Sampling: Not every successful operation needs a trace
- Aggregation: Combine similar events
- Tiered storage: Hot/warm/cold based on age
Stage 3: Scale (100-1,000 Agents)
What Changes
At 1,000 agents, architecture matters more than strategy.
# Batch and aggregate
def log_at_scale(events: list):
# Batch writes
batch = []
for event in events:
if should_log(event):
batch.append(event.to_minimal_dict())
if len(batch) >= 100:
empress.log_batch(batch)
batch = []
# Aggregate metrics
metrics = aggregate_metrics(events)
empress.log_metrics(metrics)
Architectural Changes
Key Adjustments
| Production Approach | Scale Approach |
|---|---|
| Direct logging | Buffered logging |
| Per-event writes | Batch writes |
| Central aggregation | Distributed aggregation |
| Single storage tier | Multi-tier storage |
Volume
Agents: 1,000
Potential events per day: 1,000,000
After signal filtering: 30,000
After aggregation: 10,000
Daily storage: 20 MB
Monthly cost: ~$500
New Capabilities Needed
- Distributed tracing: Correlate across agent boundaries
- Cardinality management: Limit unique metric dimensions
- Automated tiering: Move data between storage tiers
- Query federation: Search across storage tiers
Stage 4: Enterprise (1,000+ Agents)
What Changes
At enterprise scale, governance becomes critical.
# Policy-driven logging
def log_with_policy(event, policy: LoggingPolicy):
# Check if logging is required
if not policy.should_log(event):
return
# Apply transformations
sanitized = policy.sanitize(event)
enriched = policy.enrich(sanitized)
# Route to appropriate destination
destination = policy.route(enriched)
destination.log(enriched)
# Apply retention
policy.schedule_retention(enriched.id)
Governance Requirements
| Concern | Solution |
|---|---|
| Data sovereignty | Regional storage routing |
| Retention policies | Automated lifecycle management |
| Access control | Role-based log access |
| Cost allocation | Per-team/project attribution |
Volume
Agents: 10,000
Potential events per day: 10,000,000
After policy filtering: 100,000
After aggregation: 30,000
Daily storage: 60 MB
Monthly cost: ~$2,000
New Capabilities Needed
- Policy engine: Centralized logging rules
- Multi-region: Geographic distribution
- Self-service: Teams manage their own logging
- FinOps: Cost visibility and allocation
The Anti-Patterns
1. "We'll Optimize Later"
Pilot cost: $50/month
Production (unoptimized): $5,000/month
Scale (unoptimized): $500,000/month
You won't optimize later. You'll panic.
2. "Just Add More Storage"
Storage is cheap. Query is not. Your analytics will grind to a halt before your storage budget does.
3. "Log Everything, Filter at Query Time"
-- This query at scale
SELECT * FROM events
WHERE agent_id = 'x'
AND timestamp > NOW() - INTERVAL '1 day'
-- Scans 10TB to return 10KB
-- Cost: $50 per query
-- Time: 10 minutes
Filter at write time, not query time.
4. "One Size Fits All"
Different agents have different observability needs. A high-risk financial agent needs more logging than a content recommendation agent.
The Scaling Checklist
Before Going to Production (10+ agents)
- Implemented selective logging (decisions, outcomes, errors)
- Summarizing context instead of logging raw inputs
- Sampling traces for successful operations
- Batch writes enabled
Before Going to Scale (100+ agents)
- Distributed aggregation in place
- Multi-tier storage configured
- Cardinality limits enforced
- Query optimization implemented
Before Going to Enterprise (1,000+ agents)
- Policy engine deployed
- Regional routing configured
- Self-service dashboards available
- Cost attribution implemented
The Empress Scaling Model
Empress is designed to scale with you:
- Pilot: Simple SDK, direct logging
- Production: Automatic batching, built-in sampling
- Scale: Distributed aggregation, tiered storage
- Enterprise: Policy engine, multi-region, FinOps dashboard
You don't have to rebuild your observability at each stage. The platform evolves with your deployment.
Start simple. Scale smart. That's sustainable observability.