AI Agent Versioning: Deploy with Confidence

You updated a prompt. Your agent now behaves differently. But which version is running in production? And how do you roll back if something goes wrong?

AI agent versioning is more complex than traditional software versioning. The behavior depends on code, prompts, models, and configuration. Change any of them, and you have a new version.

What Defines an Agent Version?

flowchart TD A[Agent Version] --> B[Code] A --> C[Prompts] A --> D[Model] A --> E[Config] A --> F[Skills]

Each component can change independently:

Component	Example Change	Behavior Impact
Code	New action handler	New capabilities
Prompts	Updated system prompt	Different decisions
Model	GPT-4 → GPT-4-turbo	Speed/cost/quality
Config	Temperature 0.7 → 0.3	Determinism
Skills	Added new skill	New capabilities

A version is only fully defined when all components are specified.

Semantic Versioning for Agents

Adapt semver for AI agents:

v{MAJOR}.{MINOR}.{PATCH}-{MODEL}-{PROMPT_HASH}

Example: v2.3.1-gpt4t-a7b8c9

MAJOR: Breaking capability changes
MINOR: New capabilities, backward compatible
PATCH: Bug fixes, prompt tweaks
MODEL: Model identifier
PROMPT_HASH: First 6 chars of prompt hash

This tells you exactly what's running.

Tracking Versions in Actions

Every action should record the agent version:

{
  "actor": {
    "name": "Support Agent",
    "account": {
      "name": "support-agent-v2.3.1-gpt4t-a7b8c9"
    }
  },
  "verb": { "id": "resolved" },
  "object": { "id": "ticket-4892" },
  "context": {
    "extensions": {
      "agent_version": {
        "semantic": "2.3.1",
        "code_commit": "abc123",
        "prompt_version": "a7b8c9",
        "model": "gpt-4-turbo",
        "config_hash": "def456"
      }
    }
  }
}

Now you can:

Compare performance across versions
Identify which version caused an issue
Roll back to a specific state

Deployment Strategies

Blue-Green Deployment

Run two versions simultaneously, switch traffic:

flowchart LR A[Load Balancer] --> B[Blue: v2.2.0] A -.-> C[Green: v2.3.0] style B fill:#3b82f6 style C fill:#22c55e

Deploy new version (Green)
Test in production
Switch traffic
Keep old version ready for rollback

Canary Deployment

Gradually shift traffic:

flowchart LR A[Traffic] --> B{Router} B -->|90%| C[v2.2.0] B -->|10%| D[v2.3.0]

Deploy new version
Route 10% of traffic
Monitor metrics
Increase if healthy
Roll back if not

Shadow Deployment

Run new version in parallel, compare results:

flowchart TD A[Request] --> B[v2.2.0 Primary] A --> C[v2.3.0 Shadow] B --> D[Response to User] C --> E[Log for Comparison]

The user sees only v2.2.0 results. v2.3.0 runs silently. You compare decisions without risk.

A/B Testing Agents

Different versions for different users:

{
  "experiment": "support-agent-v2.3",
  "variants": [
    {
      "name": "control",
      "version": "v2.2.0",
      "traffic": 0.5
    },
    {
      "name": "treatment",
      "version": "v2.3.0",
      "traffic": 0.5
    }
  ],
  "metrics": [
    "resolution_rate",
    "customer_satisfaction",
    "cost_per_resolution"
  ]
}

Track outcomes by variant:

Metric	Control (v2.2.0)	Treatment (v2.3.0)	Delta
Resolution Rate	87.2%	91.4%	+4.2%
CSAT	4.2	4.4	+0.2
Cost/Resolution	$0.45	$0.38	-15%

Statistical significance tells you when to promote the new version.

Rollback Procedures

When things go wrong, roll back fast:

Immediate Rollback Triggers

Error rate > 10%
Latency > 2x baseline
Cost > 2x baseline
Safety violation detected

Rollback Process

flowchart TD A[Issue Detected] --> B{Automated Trigger?} B -->|Yes| C[Automatic Rollback] B -->|No| D[Human Decision] D --> E{Rollback?} E -->|Yes| C E -->|No| F[Continue Monitoring] C --> G[Switch to Previous Version] G --> H[Verify Health] H --> I[Post-Incident Review]

What Gets Rolled Back

Be explicit about rollback scope:

Component	Rollback Method
Code	Deploy previous commit
Prompts	Restore from version control
Model	Update config to previous model
Config	Restore from config store

All components should be version-controlled and restorable.

Version Comparison Dashboard

Compare versions side-by-side:

┌─────────────────────────────────────────────────────────────┐
│  VERSION COMPARISON: v2.2.0 vs v2.3.0                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  SUCCESS RATE        LATENCY (p95)      COST/ACTION         │
│  v2.2.0: 87.2%       v2.2.0: 1,247ms    v2.2.0: $0.45      │
│  v2.3.0: 91.4%       v2.3.0: 847ms      v2.3.0: $0.38      │
│  Δ: +4.2% ✓          Δ: -32% ✓          Δ: -15% ✓          │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│  DECISION DISTRIBUTION                                       │
│                                                             │
│  v2.2.0: Approve 45% │ Deny 35% │ Escalate 20%             │
│  v2.3.0: Approve 48% │ Deny 32% │ Escalate 20%             │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│  SAMPLE DECISIONS                                            │
│                                                             │
│  Case #4892:  v2.2.0=Deny  │  v2.3.0=Approve  │ Changed    │
│  Case #4893:  v2.2.0=Approve │ v2.3.0=Approve │ Same       │
│  Case #4894:  v2.2.0=Escalate│ v2.3.0=Approve │ Changed    │
└─────────────────────────────────────────────────────────────┘

Prompt Version Control

Prompts deserve the same rigor as code:

prompts/
├── support-agent/
│   ├── system.md
│   ├── classification.md
│   └── response.md
├── CHANGELOG.md
└── VERSION

Each prompt change gets:

A commit message explaining why
A version bump
Testing before deployment

The Version Manifest

Deploy with a manifest that specifies everything:

# agent-manifest.yaml
version: "2.3.1"
deployed: "2025-02-18T14:00:00Z"

components:
  code:
    commit: "abc123def456"
    image: "agents/support:2.3.1"

  prompts:
    version: "a7b8c9"
    repository: "prompts/support-agent"

  model:
    provider: "openai"
    name: "gpt-4-turbo"
    version: "2025-01-15"

  config:
    temperature: 0.7
    max_tokens: 2000

rollback:
  previous_version: "2.3.0"
  manifest: "s3://deployments/support-agent/2.3.0.yaml"

This manifest is your source of truth for what's running.

The Empress Approach

Empress tracks agent versions automatically:

Version capture in every action
Performance comparison across versions
Deployment integration with CI/CD
Rollback support with one-click restore
A/B testing with statistical analysis

You deploy with confidence because you can see exactly how each version performs, and roll back in seconds if needed.

Versioning isn't bureaucracy. It's how you move fast without breaking things.