Private BetaWe're currently in closed beta.Join the waitlist
BlogTechnical
TechnicalFebruary 18, 20255 min read

AI Agent Versioning: Deploy with Confidence

Unlike traditional software, AI agents can change behavior without code changes. Here's how to version and deploy safely.

Empress Team
AI Operations & Observability

You updated a prompt. Your agent now behaves differently. But which version is running in production? And how do you roll back if something goes wrong?

AI agent versioning is more complex than traditional software versioning. The behavior depends on code, prompts, models, and configuration. Change any of them, and you have a new version.

What Defines an Agent Version?

flowchart TD A[Agent Version] --> B[Code] A --> C[Prompts] A --> D[Model] A --> E[Config] A --> F[Skills]

Each component can change independently:

Component Example Change Behavior Impact
Code New action handler New capabilities
Prompts Updated system prompt Different decisions
Model GPT-4 → GPT-4-turbo Speed/cost/quality
Config Temperature 0.7 → 0.3 Determinism
Skills Added new skill New capabilities

A version is only fully defined when all components are specified.

Semantic Versioning for Agents

Adapt semver for AI agents:

v{MAJOR}.{MINOR}.{PATCH}-{MODEL}-{PROMPT_HASH}

Example: v2.3.1-gpt4t-a7b8c9

  • MAJOR: Breaking capability changes
  • MINOR: New capabilities, backward compatible
  • PATCH: Bug fixes, prompt tweaks
  • MODEL: Model identifier
  • PROMPT_HASH: First 6 chars of prompt hash

This tells you exactly what's running.

Tracking Versions in Actions

Every action should record the agent version:

{
  "actor": {
    "name": "Support Agent",
    "account": {
      "name": "support-agent-v2.3.1-gpt4t-a7b8c9"
    }
  },
  "verb": { "id": "resolved" },
  "object": { "id": "ticket-4892" },
  "context": {
    "extensions": {
      "agent_version": {
        "semantic": "2.3.1",
        "code_commit": "abc123",
        "prompt_version": "a7b8c9",
        "model": "gpt-4-turbo",
        "config_hash": "def456"
      }
    }
  }
}

Now you can:

  • Compare performance across versions
  • Identify which version caused an issue
  • Roll back to a specific state

Deployment Strategies

Blue-Green Deployment

Run two versions simultaneously, switch traffic:

flowchart LR A[Load Balancer] --> B[Blue: v2.2.0] A -.-> C[Green: v2.3.0] style B fill:#3b82f6 style C fill:#22c55e
  1. Deploy new version (Green)
  2. Test in production
  3. Switch traffic
  4. Keep old version ready for rollback

Canary Deployment

Gradually shift traffic:

flowchart LR A[Traffic] --> B{Router} B -->|90%| C[v2.2.0] B -->|10%| D[v2.3.0]
  1. Deploy new version
  2. Route 10% of traffic
  3. Monitor metrics
  4. Increase if healthy
  5. Roll back if not

Shadow Deployment

Run new version in parallel, compare results:

flowchart TD A[Request] --> B[v2.2.0 Primary] A --> C[v2.3.0 Shadow] B --> D[Response to User] C --> E[Log for Comparison]

The user sees only v2.2.0 results. v2.3.0 runs silently. You compare decisions without risk.

A/B Testing Agents

Different versions for different users:

{
  "experiment": "support-agent-v2.3",
  "variants": [
    {
      "name": "control",
      "version": "v2.2.0",
      "traffic": 0.5
    },
    {
      "name": "treatment",
      "version": "v2.3.0",
      "traffic": 0.5
    }
  ],
  "metrics": [
    "resolution_rate",
    "customer_satisfaction",
    "cost_per_resolution"
  ]
}

Track outcomes by variant:

Metric Control (v2.2.0) Treatment (v2.3.0) Delta
Resolution Rate 87.2% 91.4% +4.2%
CSAT 4.2 4.4 +0.2
Cost/Resolution $0.45 $0.38 -15%

Statistical significance tells you when to promote the new version.

Rollback Procedures

When things go wrong, roll back fast:

Immediate Rollback Triggers

  • Error rate > 10%
  • Latency > 2x baseline
  • Cost > 2x baseline
  • Safety violation detected

Rollback Process

flowchart TD A[Issue Detected] --> B{Automated Trigger?} B -->|Yes| C[Automatic Rollback] B -->|No| D[Human Decision] D --> E{Rollback?} E -->|Yes| C E -->|No| F[Continue Monitoring] C --> G[Switch to Previous Version] G --> H[Verify Health] H --> I[Post-Incident Review]

What Gets Rolled Back

Be explicit about rollback scope:

Component Rollback Method
Code Deploy previous commit
Prompts Restore from version control
Model Update config to previous model
Config Restore from config store

All components should be version-controlled and restorable.

Version Comparison Dashboard

Compare versions side-by-side:

┌─────────────────────────────────────────────────────────────┐
│  VERSION COMPARISON: v2.2.0 vs v2.3.0                       │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  SUCCESS RATE        LATENCY (p95)      COST/ACTION         │
│  v2.2.0: 87.2%       v2.2.0: 1,247ms    v2.2.0: $0.45      │
│  v2.3.0: 91.4%       v2.3.0: 847ms      v2.3.0: $0.38      │
│  Δ: +4.2% ✓          Δ: -32% ✓          Δ: -15% ✓          │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│  DECISION DISTRIBUTION                                       │
│                                                             │
│  v2.2.0: Approve 45% │ Deny 35% │ Escalate 20%             │
│  v2.3.0: Approve 48% │ Deny 32% │ Escalate 20%             │
│                                                             │
├─────────────────────────────────────────────────────────────┤
│  SAMPLE DECISIONS                                            │
│                                                             │
│  Case #4892:  v2.2.0=Deny  │  v2.3.0=Approve  │ Changed    │
│  Case #4893:  v2.2.0=Approve │ v2.3.0=Approve │ Same       │
│  Case #4894:  v2.2.0=Escalate│ v2.3.0=Approve │ Changed    │
└─────────────────────────────────────────────────────────────┘

Prompt Version Control

Prompts deserve the same rigor as code:

prompts/
├── support-agent/
│   ├── system.md
│   ├── classification.md
│   └── response.md
├── CHANGELOG.md
└── VERSION

Each prompt change gets:

  • A commit message explaining why
  • A version bump
  • Testing before deployment

The Version Manifest

Deploy with a manifest that specifies everything:

# agent-manifest.yaml
version: "2.3.1"
deployed: "2025-02-18T14:00:00Z"

components:
  code:
    commit: "abc123def456"
    image: "agents/support:2.3.1"

  prompts:
    version: "a7b8c9"
    repository: "prompts/support-agent"

  model:
    provider: "openai"
    name: "gpt-4-turbo"
    version: "2025-01-15"

  config:
    temperature: 0.7
    max_tokens: 2000

rollback:
  previous_version: "2.3.0"
  manifest: "s3://deployments/support-agent/2.3.0.yaml"

This manifest is your source of truth for what's running.

The Empress Approach

Empress tracks agent versions automatically:

  • Version capture in every action
  • Performance comparison across versions
  • Deployment integration with CI/CD
  • Rollback support with one-click restore
  • A/B testing with statistical analysis

You deploy with confidence because you can see exactly how each version performs, and roll back in seconds if needed.

Versioning isn't bureaucracy. It's how you move fast without breaking things.

Share this article
Now in private beta

Ready to see what your AI agents do?

Complete observability for autonomous systems. One platform for compliance, operations, and intelligence.