The True Cost of Over-Logging AI Agents

A startup I advised was spending $47,000 per month on observability. Their AI agents processed about 10,000 tasks per day.

That's $4.70 per task—just for logging.

Their agents were creating value at roughly $3 per task.

They were losing money on observability alone.

The Obvious Costs

Storage

Every log entry costs money to store. At scale, this adds up fast.

Daily log volume: 50GB
Monthly storage: 1.5TB
Annual storage: 18TB
Cost per TB: $23/month (cloud storage)
Annual storage cost: $5,000+

But storage is cheap, right? Keep reading.

Ingestion

Getting data into your observability system costs more than storing it.

Log entries per day: 10 million
Ingestion cost: $0.50 per million
Daily ingestion: $5
Monthly ingestion: $150
Annual ingestion: $1,800

Still manageable. But we're not done.

Query and Analysis

This is where costs explode. Querying large datasets is expensive.

Queries per day: 500
Average data scanned: 100GB per query
Cost per GB scanned: $0.005
Daily query cost: $250
Monthly query cost: $7,500
Annual query cost: $90,000

Now we're talking real money.

Retention

Compliance often requires keeping logs for years. Multiply everything above by your retention period.

flowchart LR A[Log Entry] --> B[Ingestion Cost] B --> C[Storage Cost] C --> D[Query Cost] D --> E[Retention Cost] E --> F[Total: 10-50x Raw Storage]

The Hidden Costs

The financial costs are obvious. The hidden costs are worse.

Alert Fatigue

When you log everything, anomaly detection becomes impossible. Every spike looks like every other spike.

Logging Strategy	Daily Alerts	Actionable	Response Rate
Log everything	200+	3%	12%
Signal only	8	85%	94%

Your team stops responding to alerts because most are noise. Then a real problem gets ignored.

Debugging Speed

More logs should mean faster debugging, right?

Wrong.

Time to find root cause (over-logging): 4.2 hours average
Time to find root cause (signal-focused): 23 minutes average

When you're searching through millions of irrelevant entries, finding the signal takes forever. When you only log signal, you go straight to the answer.

Decision Paralysis

With too much data, teams struggle to act. Every metric has a counter-metric. Every trend has an exception.

"Our error rate is up, but our throughput is also up, and our p95 latency is down, but our p99 is up, and..."

Signal-focused observability gives you clear answers: this is working, this isn't, here's what to do.

Compliance Risk

Paradoxically, logging more can increase compliance risk.

Data retention: You're now storing data you might need to delete
Privacy: More logs mean more places for PII to hide
Discovery: In litigation, everything you logged is discoverable

Log what you need for compliance. Not more.

The Math of Signal vs Noise

Let's compare two approaches for the same 10,000 daily tasks:

Over-Logging Approach

Events per task: 150
Daily events: 1,500,000
Event size: 500 bytes
Daily data: 750MB
Monthly data: 22.5GB
Monthly cost (all-in): $2,400
Cost per task: $0.24

Signal-Focused Approach

Events per task: 3 (decision, outcome, any exceptions)
Daily events: 30,000
Event size: 800 bytes (richer context)
Daily data: 24MB
Monthly data: 720MB
Monthly cost (all-in): $85
Cost per task: $0.0085

Savings: 97%

And the signal-focused approach provides better debugging and compliance capabilities because every event is meaningful.

How to Reduce Costs

1. Audit Your Current Logging

What are you actually logging? Most teams don't know.

-- What's in your logs?
SELECT event_type, COUNT(*) as volume
FROM logs
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY event_type
ORDER BY volume DESC
LIMIT 20

You'll likely find that 80% of your log volume comes from 5% of event types.

2. Apply the Decision Framework

For each high-volume event type, ask:

Is this a decision, outcome, or exception?
Have we ever acted on this data?
Is this required for compliance?

If none of the above: stop logging it.

3. Aggregate, Don't Enumerate

Instead of logging every cache hit:

// Don't do this
{"event": "cache_hit", "key": "user:123"}
{"event": "cache_hit", "key": "user:456"}
{"event": "cache_hit", "key": "user:789"}
// ... 10,000 more

Aggregate:

// Do this
{
  "event": "cache_performance",
  "period": "5m",
  "hits": 10000,
  "misses": 47,
  "hit_rate": 0.9953
}

Same insight, 99.99% less data.

4. Sample High-Volume Events

If you must log operational data, sample it.

import random

def should_log_operation():
    return random.random() < 0.01  # 1% sample

if should_log_operation():
    empress.log_operation(...)

You can still detect trends and anomalies with sampled data.

The Empress Approach

Empress is priced on decisions and outcomes—not log volume. This aligns incentives: you're not penalized for capturing rich context around the events that matter.

Log fewer events with more detail. Pay less. Get better insights.

That's how observability should work.