The Minimum Viable Audit Trail for AI Agents

When compliance teams get involved, logging expands. Every question about "what if we need..." adds another field. Every audit finding adds another event type.

Before long, you're logging everything and drowning in data.

There's a better way: define the minimum viable audit trail and stick to it.

What Auditors Actually Need

I've been through dozens of AI system audits. Here's what auditors consistently ask:

What decision was made?
When was it made?
What information was available?
Why was this choice selected?
Who (or what) made the decision?
What was the outcome?
Was human oversight involved?

That's it. Seven questions. Everything else is nice-to-have.

The MVAT Schema

Minimum Viable Audit Trail. One record per decision.

{
  "id": "dec_a1b2c3d4",
  "timestamp": "2025-02-23T14:30:00Z",
  "agent": {
    "id": "support-agent-prod",
    "version": "2.3.1"
  },
  "decision": {
    "type": "refund_approval",
    "choice": "approve",
    "confidence": 0.87
  },
  "context": {
    "summary": "Customer requested refund for order #12345, $127.50, within policy window",
    "key_factors": ["order_age_days: 12", "customer_tier: gold", "reason: product_defective"]
  },
  "reasoning": "Order within 30-day window, customer is gold tier, defect claim consistent with product batch issues",
  "outcome": {
    "status": "completed",
    "result": "refund_processed",
    "timestamp": "2025-02-23T14:30:02Z"
  },
  "human_oversight": {
    "required": false,
    "provided": false
  }
}

That's 15 fields. Not 150.

Breaking Down Each Component

1. Identity (id, timestamp)

Unique identifier and when it happened. Non-negotiable basics.

{
  "id": "dec_a1b2c3d4",
  "timestamp": "2025-02-23T14:30:00Z"
}

2. Agent (agent.id, agent.version)

Which agent, which version. Critical for reproducibility.

{
  "agent": {
    "id": "support-agent-prod",
    "version": "2.3.1"
  }
}

Not needed: Internal agent configuration, model parameters, prompt templates. These can be retrieved from version control if needed.

3. Decision (type, choice, confidence)

What kind of decision, what was chosen, how confident.

{
  "decision": {
    "type": "refund_approval",
    "choice": "approve",
    "confidence": 0.87
  }
}

Not needed: All options considered, detailed scoring of each option. The choice and confidence are sufficient for most audits.

4. Context (summary, key_factors)

What information was available. Summarized, not raw.

{
  "context": {
    "summary": "Customer requested refund for order #12345, $127.50, within policy window",
    "key_factors": ["order_age_days: 12", "customer_tier: gold", "reason: product_defective"]
  }
}

Critical insight: You don't need to log the entire input. You need to log enough to understand why this input led to this decision.

Not needed: Full customer record, complete order history, raw API responses.

5. Reasoning

Why this choice was made. Human-readable.

{
  "reasoning": "Order within 30-day window, customer is gold tier, defect claim consistent with product batch issues"
}

This is the most important field. It's what auditors actually read.

6. Outcome (status, result, timestamp)

What happened after the decision.

{
  "outcome": {
    "status": "completed",
    "result": "refund_processed",
    "timestamp": "2025-02-23T14:30:02Z"
  }
}

Not needed: Detailed execution logs, API call traces, retry attempts.

7. Human Oversight

Was human review required? Did it happen?

{
  "human_oversight": {
    "required": false,
    "provided": false
  }
}

If provided:

{
  "human_oversight": {
    "required": true,
    "provided": true,
    "reviewer": "agent_jane",
    "review_timestamp": "2025-02-23T14:31:00Z",
    "review_action": "approved_as_recommended"
  }
}

What You Don't Need

Raw Inputs

Don't log:

{
  "raw_input": {
    "customer": { /* 50 fields */ },
    "order": { /* 30 fields */ },
    "history": [ /* 100 interactions */ ]
  }
}

Do log:

{
  "context": {
    "summary": "Gold customer, order $127.50, within policy window",
    "key_factors": ["order_age_days: 12", "customer_tier: gold"]
  }
}

Internal Processing

Don't log:

{
  "processing": {
    "embedding_generated": true,
    "vector_search_results": 47,
    "relevance_scores": [0.92, 0.87, 0.81],
    "context_window_tokens": 3847,
    "model_calls": 3
  }
}

This is implementation detail, not audit trail.

Successful Operations

Don't log every successful API call, database query, or cache hit. Log failures and anomalies.

PII Beyond Identifiers

Don't log customer names, emails, addresses. Log customer IDs that can be joined to source systems if needed.

Storage Calculation

With MVAT, a typical decision record is ~500 bytes.

Decisions per day: 10,000
Bytes per decision: 500
Daily storage: 5 MB
Monthly storage: 150 MB
Annual storage: 1.8 GB
5-year retention: 9 GB

Compare to "log everything":

Events per day: 500,000
Bytes per event: 1,000
Daily storage: 500 MB
Monthly storage: 15 GB
Annual storage: 180 GB
5-year retention: 900 GB

100x difference.

Implementation

The Logger

def log_decision(
    agent_id: str,
    agent_version: str,
    decision_type: str,
    choice: str,
    confidence: float,
    context_summary: str,
    key_factors: list[str],
    reasoning: str,
    human_required: bool = False,
    human_review: dict = None
):
    empress.log({
        "id": generate_id(),
        "timestamp": now_iso(),
        "agent": {
            "id": agent_id,
            "version": agent_version
        },
        "decision": {
            "type": decision_type,
            "choice": choice,
            "confidence": confidence
        },
        "context": {
            "summary": context_summary,
            "key_factors": key_factors
        },
        "reasoning": reasoning,
        "human_oversight": {
            "required": human_required,
            "provided": human_review is not None,
            **(human_review or {})
        }
    })

    return decision_id

def log_outcome(decision_id: str, status: str, result: str):
    empress.update(decision_id, {
        "outcome": {
            "status": status,
            "result": result,
            "timestamp": now_iso()
        }
    })

Usage

# When decision is made
decision_id = log_decision(
    agent_id="support-agent-prod",
    agent_version="2.3.1",
    decision_type="refund_approval",
    choice="approve",
    confidence=0.87,
    context_summary="Customer requested refund for order #12345, $127.50",
    key_factors=["order_age_days: 12", "customer_tier: gold"],
    reasoning="Within policy window, gold tier customer, valid defect claim"
)

# After action completes
log_outcome(decision_id, "completed", "refund_processed")

The Audit Test

To validate your MVAT:

Pull 10 random decision records
For each, have someone unfamiliar with the system read it
Ask them: "What happened and why?"
If they can answer correctly, your audit trail is sufficient

If they can't, add detail to those specific areas. Don't expand everything.

Less data. Full auditability. That's the minimum viable audit trail.