A startup I advised was spending $47,000 per month on observability. Their AI agents processed about 10,000 tasks per day.
That's $4.70 per task—just for logging.
Their agents were creating value at roughly $3 per task.
They were losing money on observability alone.
The Obvious Costs
Storage
Every log entry costs money to store. At scale, this adds up fast.
Daily log volume: 50GB
Monthly storage: 1.5TB
Annual storage: 18TB
Cost per TB: $23/month (cloud storage)
Annual storage cost: $5,000+
But storage is cheap, right? Keep reading.
Ingestion
Getting data into your observability system costs more than storing it.
Log entries per day: 10 million
Ingestion cost: $0.50 per million
Daily ingestion: $5
Monthly ingestion: $150
Annual ingestion: $1,800
Still manageable. But we're not done.
Query and Analysis
This is where costs explode. Querying large datasets is expensive.
Queries per day: 500
Average data scanned: 100GB per query
Cost per GB scanned: $0.005
Daily query cost: $250
Monthly query cost: $7,500
Annual query cost: $90,000
Now we're talking real money.
Retention
Compliance often requires keeping logs for years. Multiply everything above by your retention period.
The Hidden Costs
The financial costs are obvious. The hidden costs are worse.
Alert Fatigue
When you log everything, anomaly detection becomes impossible. Every spike looks like every other spike.
| Logging Strategy | Daily Alerts | Actionable | Response Rate |
|---|---|---|---|
| Log everything | 200+ | 3% | 12% |
| Signal only | 8 | 85% | 94% |
Your team stops responding to alerts because most are noise. Then a real problem gets ignored.
Debugging Speed
More logs should mean faster debugging, right?
Wrong.
Time to find root cause (over-logging): 4.2 hours average
Time to find root cause (signal-focused): 23 minutes average
When you're searching through millions of irrelevant entries, finding the signal takes forever. When you only log signal, you go straight to the answer.
Decision Paralysis
With too much data, teams struggle to act. Every metric has a counter-metric. Every trend has an exception.
"Our error rate is up, but our throughput is also up, and our p95 latency is down, but our p99 is up, and..."
Signal-focused observability gives you clear answers: this is working, this isn't, here's what to do.
Compliance Risk
Paradoxically, logging more can increase compliance risk.
- Data retention: You're now storing data you might need to delete
- Privacy: More logs mean more places for PII to hide
- Discovery: In litigation, everything you logged is discoverable
Log what you need for compliance. Not more.
The Math of Signal vs Noise
Let's compare two approaches for the same 10,000 daily tasks:
Over-Logging Approach
Events per task: 150
Daily events: 1,500,000
Event size: 500 bytes
Daily data: 750MB
Monthly data: 22.5GB
Monthly cost (all-in): $2,400
Cost per task: $0.24
Signal-Focused Approach
Events per task: 3 (decision, outcome, any exceptions)
Daily events: 30,000
Event size: 800 bytes (richer context)
Daily data: 24MB
Monthly data: 720MB
Monthly cost (all-in): $85
Cost per task: $0.0085
Savings: 97%
And the signal-focused approach provides better debugging and compliance capabilities because every event is meaningful.
How to Reduce Costs
1. Audit Your Current Logging
What are you actually logging? Most teams don't know.
-- What's in your logs?
SELECT event_type, COUNT(*) as volume
FROM logs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY event_type
ORDER BY volume DESC
LIMIT 20
You'll likely find that 80% of your log volume comes from 5% of event types.
2. Apply the Decision Framework
For each high-volume event type, ask:
- Is this a decision, outcome, or exception?
- Have we ever acted on this data?
- Is this required for compliance?
If none of the above: stop logging it.
3. Aggregate, Don't Enumerate
Instead of logging every cache hit:
// Don't do this
{"event": "cache_hit", "key": "user:123"}
{"event": "cache_hit", "key": "user:456"}
{"event": "cache_hit", "key": "user:789"}
// ... 10,000 more
Aggregate:
// Do this
{
"event": "cache_performance",
"period": "5m",
"hits": 10000,
"misses": 47,
"hit_rate": 0.9953
}
Same insight, 99.99% less data.
4. Sample High-Volume Events
If you must log operational data, sample it.
import random
def should_log_operation():
return random.random() < 0.01 # 1% sample
if should_log_operation():
empress.log_operation(...)
You can still detect trends and anomalies with sampled data.
The Empress Approach
Empress is priced on decisions and outcomes—not log volume. This aligns incentives: you're not penalized for capturing rich context around the events that matter.
Log fewer events with more detail. Pay less. Get better insights.
That's how observability should work.