Why this matters
In real NLP systems, data and prompts change constantly: new tickets, new product terms, new compliance phrases, better few-shot examples. Without clear versioning, you cannot reproduce results, debug regressions, or safely roll back. Versioning makes your work auditable, reversible, and collaborative.
- Releases: Ship model changes tied to a specific dataset and prompt version.
- Investigations: Trace a production issue to the exact prompt and data snapshot.
- Compliance: Prove what the model saw and how it was instructed on a given date.
Who this is for
- NLP Engineers who ship models and LLM features into production.
- Data Scientists who run iterative experiments and need reproducibility.
- MLOps/Platform engineers who standardize workflows for teams.
Prerequisites
- Basic command of Git and structured data formats (CSV/Parquet/JSONL).
- Comfort with experiment tracking concepts (runs, metrics, parameters).
- Understanding of prompts: templates, variables, and few-shot examples.
Concept explained simply
Versioning is giving every important artifact a unique, immutable label. For NLP, that means datasets (and splits), schemas/labels, evaluation sets, prompt templates, few-shot examples, and RAG indexes. When anything changes, you bump a version and store a short note describing why.
Mental model
Think of your system as a recipe:
- Ingredients: dataset snapshot, label schema, evaluation set, retrieval index.
- Spices: prompt template, few-shot examples, system message.
- Oven settings: random seed, decoding params (temperature, top-p), tokenizer.
If you label each piece and record them together, anyone can bake the same cake later.
What to version (quick checklist)
- Dataset snapshot and split manifests
- Annotation schema/label taxonomy
- Eval set and scoring script
- Prompt template(s) and variables
- Few-shot example set
- RAG corpus and index build config
- Decoding parameters and random seed
- Code commit that executed the run
Core components
- Immutable storage: never overwrite; create a new versioned artifact.
- Semantic versioning: MAJOR.MINOR.PATCH
- MAJOR: breaking changes (schema/label changes, variable renames).
- MINOR: backward-compatible additions (more data, new few-shots).
- PATCH: small fixes (typo corrections, bugfix in annotation).
- Provenance: always record who/when/why for each change.
- Lineage: link runs to exact data/prompt versions and parameters.
Good defaults you can adopt today
- Dataset file/tag format: name-YYYYMMDD-MAJOR.MINOR.PATCH.ext
- Prompt template ID: name@MAJOR.MINOR.PATCH
- Few-shot set ID: fewshots/name@MAJOR.MINOR.PATCH
- RAG index tag: index/collection@MAJOR.MINOR.PATCH
- Experiment record fields: dataset_version, prompt_version, fewshot_version, index_version, code_commit, seed, decoding_params, eval_set_version.
Worked examples
1) Dataset bugfix vs. addition
Scenario: You found 50 mislabeled tickets and corrected them. No schema change.
- Before: support_tickets@1.2.0
- After: support_tickets@1.2.1 (PATCH)
- Changelog: "Fixed 50 mislabeled entries in category=Billing. No new rows."
Later you append 20k new tickets, same schema. That is MINOR: 1.3.0.
2) Prompt template variable rename (breaking)
Original: triage_prompt@2.1.0 uses {{ticket_text}}. You rename to {{text}} and adjust examples.
- Change type: breaking (downstream code must update) → MAJOR
- New version: triage_prompt@3.0.0
- Changelog: "Rename variable ticket_text→text; updated 3 few-shots for clarity."
3) RAG pipeline snapshot
You rebuild a knowledge base index after adding 100 new articles.
- kb_corpus@1.4.0 → added documents (MINOR)
- kb_index@1.4.0 → rebuilt with the new corpus and same retriever config
- Prompt unchanged: answer_prompt@1.1.2
- Experiment record ties these together so analysis is reproducible.
Step-by-step: Minimal versioning workflow
- Define IDs
Pick canonical names for each artifact: dataset, eval set, prompt, few-shots, index. - Freeze snapshots
Export immutable files or references; do not overwrite; store with versioned names. - Apply semantic versioning
Choose MAJOR/MINOR/PATCH on every change; write a one-line reason. - Record lineage
For each run, capture dataset_version, prompt_version, fewshot_version, index_version, code commit, seed, decoding params, and eval set. - Automate checks
Block runs if any artifact lacks a version or changelog message.
Exercises
Do these now. They mirror the tasks below and are available to everyone; only logged-in users get saved progress.
Exercise 1: Draft a Data & Prompt Versioning Policy
Create a short policy for a hypothetical "Support Tickets" classification system and its "Triage Prompt" template.
- Name the artifacts and propose semantic versioning rules.
- Write example changelog entries for a patch, minor, and major change.
- Specify storage naming conventions.
Exercise 2: Create a Reproducibility Record
Fill in a run record that someone else can use to reproduce your experiment.
- Include dataset_version, prompt_version, fewshot_version, index_version.
- Add code_commit (placeholder), seed, decoding params, and eval_set_version.
- Write a one-sentence purpose for the run.
Self-check checklist
- You used MAJOR/MINOR/PATCH consistently.
- Every artifact is immutable and independently addressable.
- Your run record can be executed by a teammate without asking questions.
- You avoided ambiguous labels like "latest" or "final".
Common mistakes and how to self-check
- Overwriting files: Instead, write a new artifact with a bumped version. Self-check: can you restore last week’s state bit-for-bit?
- Not freezing splits: Keep explicit train/dev/test manifests per version. Self-check: do your split files have stable IDs?
- Mixing eval data with training: Version eval sets separately. Self-check: eval set versions appear in runs and never overlap with training versions.
- Changing prompt without a version bump: Even small wording changes can matter. Self-check: prompts in production must have explicit versions.
- Forgetting seeds and decoding params: Self-check: do run records include seed, temperature, top-p, and tokenizer?
Practical projects
- Build a versioned dataset packer: given a raw CSV, produce a frozen JSONL snapshot and a split manifest with a semantic version and changelog.
- Prompt registry: store prompt templates and few-shot sets with versions and validations for required variables.
- RAG snapshot tool: given a corpus directory, build an index and emit index_version metadata tied to corpus_version and retriever config.
Learning path
- Adopt naming and semantic versioning rules for data and prompts.
- Freeze your current dataset, eval set, and prompts as baseline v1.0.0.
- Instrument run logging to capture all artifact versions and params.
- Add automated checks to prevent unversioned runs.
- Scale to RAG and multi-prompt systems with consistent IDs.
Next steps
- Standardize a team-wide changelog format.
- Add scheduled evaluations comparing current production vs. baseline artifact versions.
- Automate release notes that include data/prompt/index versions and key metrics.
Mini challenge
You maintain an intent classifier and a response-generation prompt. Yesterday, customer intent labels were split ("Billing" → "Billing-General" and "Refunds"). You also added two few-shot examples to the generation prompt but did not change variables.
- Propose new versions for: label schema, dataset snapshot, classifier prompt, generation prompt, and eval set.
- Write one-sentence changelog entries for each.
- Ensure your decisions follow semantic versioning logic.
Quick Test notice
The Quick Test is available below. Everyone can take it; only logged-in users get saved progress.