You updated a prompt. Your agent now behaves differently. But which version is running in production? And how do you roll back if something goes wrong?
AI agent versioning is more complex than traditional software versioning. The behavior depends on code, prompts, models, and configuration. Change any of them, and you have a new version.
What Defines an Agent Version?
Each component can change independently:
| Component | Example Change | Behavior Impact |
|---|---|---|
| Code | New action handler | New capabilities |
| Prompts | Updated system prompt | Different decisions |
| Model | GPT-4 → GPT-4-turbo | Speed/cost/quality |
| Config | Temperature 0.7 → 0.3 | Determinism |
| Skills | Added new skill | New capabilities |
A version is only fully defined when all components are specified.
Semantic Versioning for Agents
Adapt semver for AI agents:
v{MAJOR}.{MINOR}.{PATCH}-{MODEL}-{PROMPT_HASH}
Example: v2.3.1-gpt4t-a7b8c9
- MAJOR: Breaking capability changes
- MINOR: New capabilities, backward compatible
- PATCH: Bug fixes, prompt tweaks
- MODEL: Model identifier
- PROMPT_HASH: First 6 chars of prompt hash
This tells you exactly what's running.
Tracking Versions in Actions
Every action should record the agent version:
{
"actor": {
"name": "Support Agent",
"account": {
"name": "support-agent-v2.3.1-gpt4t-a7b8c9"
}
},
"verb": { "id": "resolved" },
"object": { "id": "ticket-4892" },
"context": {
"extensions": {
"agent_version": {
"semantic": "2.3.1",
"code_commit": "abc123",
"prompt_version": "a7b8c9",
"model": "gpt-4-turbo",
"config_hash": "def456"
}
}
}
}
Now you can:
- Compare performance across versions
- Identify which version caused an issue
- Roll back to a specific state
Deployment Strategies
Blue-Green Deployment
Run two versions simultaneously, switch traffic:
- Deploy new version (Green)
- Test in production
- Switch traffic
- Keep old version ready for rollback
Canary Deployment
Gradually shift traffic:
- Deploy new version
- Route 10% of traffic
- Monitor metrics
- Increase if healthy
- Roll back if not
Shadow Deployment
Run new version in parallel, compare results:
The user sees only v2.2.0 results. v2.3.0 runs silently. You compare decisions without risk.
A/B Testing Agents
Different versions for different users:
{
"experiment": "support-agent-v2.3",
"variants": [
{
"name": "control",
"version": "v2.2.0",
"traffic": 0.5
},
{
"name": "treatment",
"version": "v2.3.0",
"traffic": 0.5
}
],
"metrics": [
"resolution_rate",
"customer_satisfaction",
"cost_per_resolution"
]
}
Track outcomes by variant:
| Metric | Control (v2.2.0) | Treatment (v2.3.0) | Delta |
|---|---|---|---|
| Resolution Rate | 87.2% | 91.4% | +4.2% |
| CSAT | 4.2 | 4.4 | +0.2 |
| Cost/Resolution | $0.45 | $0.38 | -15% |
Statistical significance tells you when to promote the new version.
Rollback Procedures
When things go wrong, roll back fast:
Immediate Rollback Triggers
- Error rate > 10%
- Latency > 2x baseline
- Cost > 2x baseline
- Safety violation detected
Rollback Process
What Gets Rolled Back
Be explicit about rollback scope:
| Component | Rollback Method |
|---|---|
| Code | Deploy previous commit |
| Prompts | Restore from version control |
| Model | Update config to previous model |
| Config | Restore from config store |
All components should be version-controlled and restorable.
Version Comparison Dashboard
Compare versions side-by-side:
┌─────────────────────────────────────────────────────────────┐
│ VERSION COMPARISON: v2.2.0 vs v2.3.0 │
├─────────────────────────────────────────────────────────────┤
│ │
│ SUCCESS RATE LATENCY (p95) COST/ACTION │
│ v2.2.0: 87.2% v2.2.0: 1,247ms v2.2.0: $0.45 │
│ v2.3.0: 91.4% v2.3.0: 847ms v2.3.0: $0.38 │
│ Δ: +4.2% ✓ Δ: -32% ✓ Δ: -15% ✓ │
│ │
├─────────────────────────────────────────────────────────────┤
│ DECISION DISTRIBUTION │
│ │
│ v2.2.0: Approve 45% │ Deny 35% │ Escalate 20% │
│ v2.3.0: Approve 48% │ Deny 32% │ Escalate 20% │
│ │
├─────────────────────────────────────────────────────────────┤
│ SAMPLE DECISIONS │
│ │
│ Case #4892: v2.2.0=Deny │ v2.3.0=Approve │ Changed │
│ Case #4893: v2.2.0=Approve │ v2.3.0=Approve │ Same │
│ Case #4894: v2.2.0=Escalate│ v2.3.0=Approve │ Changed │
└─────────────────────────────────────────────────────────────┘
Prompt Version Control
Prompts deserve the same rigor as code:
prompts/
├── support-agent/
│ ├── system.md
│ ├── classification.md
│ └── response.md
├── CHANGELOG.md
└── VERSION
Each prompt change gets:
- A commit message explaining why
- A version bump
- Testing before deployment
The Version Manifest
Deploy with a manifest that specifies everything:
# agent-manifest.yaml
version: "2.3.1"
deployed: "2025-02-18T14:00:00Z"
components:
code:
commit: "abc123def456"
image: "agents/support:2.3.1"
prompts:
version: "a7b8c9"
repository: "prompts/support-agent"
model:
provider: "openai"
name: "gpt-4-turbo"
version: "2025-01-15"
config:
temperature: 0.7
max_tokens: 2000
rollback:
previous_version: "2.3.0"
manifest: "s3://deployments/support-agent/2.3.0.yaml"
This manifest is your source of truth for what's running.
The Empress Approach
Empress tracks agent versions automatically:
- Version capture in every action
- Performance comparison across versions
- Deployment integration with CI/CD
- Rollback support with one-click restore
- A/B testing with statistical analysis
You deploy with confidence because you can see exactly how each version performs, and roll back in seconds if needed.
Versioning isn't bureaucracy. It's how you move fast without breaking things.