
Your exec team asks: "How do we know this AI feature is still accurate six months after launch?" You say, "We tested it thoroughly." They ask: "Show me the test results." You realize: the evaluation dataset changed three times since launch, and no one versioned it.
This isn't a hypothetical. I've watched legal reviews stall for weeks because teams couldn't reproduce their original accuracy claims. The model was identical. The test data had drifted.
NeurIPS and ICML now require **artifact checklists**—researchers must publish code, data, and reproduction instructions. The standard is simple: if another team can't reproduce your results, your paper gets flagged.
NeurIPS and ICML now require artifact checklists—researchers must publish code, data, and reproduction instructions. The standard is simple: if another team can't reproduce your results, your paper gets flagged.
Enterprise AI should adopt the same discipline. Your evaluation dataset is an artifact. Version it like code, or accept that your accuracy claims are aspirational.

Eval Dataset as Code: Commit 1: Initial Dataset (v1.0) - 100 labeled examples, sourced from Q4 2024 support tickets - Annotator: Senior analyst (internal ID: SA-003) - Date: 2024-12-15 - Hash: a3f7b2c (immutable) Commit 2: Dataset Expansion (v1.1) - +50 edge cases from production errors (Jan 2025) - Annotator: Same (SA-003) - Date: 2025-01-10 - Hash: d8e4f9a - Change log: "Added failure modes from first 2 weeks GA" Commit 3: Domain Shift (v2.0) - New use case (switched from support → sales) - Re-labeled all 150 examples for new context - Annotator: Sales ops lead (SO-012) - Date: 2025-03-01 - Hash: k2m9n5p - Breaking change: Not comparable to v1.xClick to examine closely
Why This Matters: When legal asks, "Can you prove accuracy hasn't degraded?"—you run the current model on eval dataset v1.0 (locked hash) and compare to launch metrics. If you didn't version the dataset, you're guessing.
Dataset Files:
Evaluation Scripts:
Results Logs:
Access Control:
Feature: AI-generated patient summaries for physicians.
V1.0 Eval Set (Launch, Dec 2024):
f7a3b9eMonth 3 Post-Launch (March 2025):
What If We Hadn't Versioned?

$ python eval.py --model v1.2.0 --dataset v1.0 --output results/2025-03-15.json
Make evaluation reproducible with one command:
# Reproduces launch accuracy $ python eval.py --model v1.2.0 --dataset v1.0 --output results/2025-03-15.json Output: Model: gpt-4-turbo-20241215 Dataset: eval_set_v1.0 (hash: f7a3b9e, 200 examples) Precision: 0.89 Recall: 0.91 F1: 0.90 Runtime: 42s Cost: $0.18Click to examine closely
If another PM can't run this command and get the same results, your reproducibility fails.
Versioning eval sets takes 10 minutes. Reproducing accuracy claims without versioned data takes 3 weeks.
When your CISO asks, "Prove this AI feature is still safe," you'll either show a reproducible eval pipeline—or scramble to recreate the evidence.
Reproducibility isn't a research luxury. It's an enterprise requirement.
Alex Welcing is a Senior AI Product Manager with 1,000+ production commits. He versions evaluation datasets like code because audits don't accept "we tested it once in 2024."