A Coding Implementation for Developing Robust Transformation and Regression Testing Workflows for Large-Scale Language Models using MLflow

In this tutorial, we show how we treat data as first-class, versioned artifacts and apply robust regression testing to the behavior of a large language model using MLflow. We design a test pipeline that includes data versions, rapid variations, model outputs, and multiple quality metrics in a fully reproducible manner. By combining classical text metrics with semantic similarity and automatic regression flags, we show how we can systematically detect performance degradation caused by seemingly small instantaneous changes. Along with the course, we focus on creating a workflow that reflects the actual processes of software engineering, but applied to the engineering and evaluation of the LLM. Check it out FULL CODES here.
!pip -q install -U "openai>=1.0.0" mlflow rouge-score nltk sentence-transformers scikit-learn pandas
import os, json, time, difflib, re
from typing import List, Dict, Any, Tuple
import mlflow
import pandas as pd
import numpy as np
from openai import OpenAI
from rouge_score import rouge_scorer
import nltk
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
nltk.download("punkt", quiet=True)
nltk.download("punkt_tab", quiet=True)
if not os.getenv("OPENAI_API_KEY"):
try:
from google.colab import userdata # type: ignore
k = userdata.get("OPENAI_API_KEY")
if k:
os.environ["OPENAI_API_KEY"] = k
except Exception:
pass
if not os.getenv("OPENAI_API_KEY"):
import getpass
os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter OPENAI_API_KEY (input hidden): ").strip()
assert os.getenv("OPENAI_API_KEY"), "OPENAI_API_KEY is required."
Set up the deployment environment by installing all the necessary dependencies and importing the essential libraries used throughout the tutorial. We securely load the OpenAI API key at runtime, ensuring that credentials are not hard-coded in the directory. We also run key NLP resources to ensure that the test pipeline runs reliably across environments.
MODEL = "gpt-4o-mini"
TEMPERATURE = 0.2
MAX_OUTPUT_TOKENS = 250
ABS_SEM_SIM_MIN = 0.78
DELTA_SEM_SIM_MAX_DROP = 0.05
DELTA_ROUGE_L_MAX_DROP = 0.08
DELTA_BLEU_MAX_DROP = 0.10
mlflow.set_tracking_uri("file:/content/mlruns")
mlflow.set_experiment("prompt_versioning_llm_regression")
client = OpenAI()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
EVAL_SET = [
{
"id": "q1",
"input": "Summarize in one sentence: MLflow tracks experiments, runs, parameters, metrics, and artifacts.",
"reference": "MLflow helps track machine learning experiments by logging runs with parameters, metrics, and artifacts."
},
{
"id": "q2",
"input": "Rewrite professionally: 'this model is kinda slow but it works ok.'",
"reference": "The model is somewhat slow, but it performs reliably."
},
{
"id": "q3",
"input": "Extract key fields as JSON: 'Order 5531 by Alice costs $42.50 and ships to Toronto.'",
"reference": '{"order_id":"5531","customer":"Alice","amount_usd":42.50,"city":"Toronto"}'
},
{
"id": "q4",
"input": "Answer briefly: What is prompt regression testing?",
"reference": "Prompt regression testing checks whether prompt changes degrade model outputs compared to a baseline."
},
]
PROMPTS = [
{
"version": "v1_baseline",
"prompt": (
"You are a precise assistant.n"
"Follow the user request carefully.n"
"If asked for JSON, output valid JSON only.n"
"User: {user_input}"
)
},
{
"version": "v2_formatting",
"prompt": (
"You are a helpful, structured assistant.n"
"Respond clearly and concisely.n"
"Prefer clean formatting.n"
"User request: {user_input}"
)
},
{
"version": "v3_guardrailed",
"prompt": (
"You are a rigorous assistant.n"
"Rules:n"
"1) If user asks for JSON, output ONLY valid minified JSON.n"
"2) Otherwise, keep the answer short and factual.n"
"User: {user_input}"
)
},
]
We describe all test settings, including model parameters, regression thresholds, and MLflow trace settings. We create a test dataset and publicly declare multiple versions of the data to compare and test for regression. By centralizing these definitions, we ensure that rapid changes and test logic are always controlled and repeatable.
def call_llm(formatted_prompt: str) -> str:
resp = client.responses.create(
model=MODEL,
input=formatted_prompt,
temperature=TEMPERATURE,
max_output_tokens=MAX_OUTPUT_TOKENS,
)
out = getattr(resp, "output_text", None)
if out:
return out.strip()
try:
texts = []
for item in resp.output:
if getattr(item, "type", "") == "message":
for c in item.content:
if getattr(c, "type", "") in ("output_text", "text"):
texts.append(getattr(c, "text", ""))
return "n".join(texts).strip()
except Exception:
return ""
smooth = SmoothingFunction().method3
rouge = rouge_scorer.RougeScorer(["rougeL"], use_stemmer=True)
def safe_tokenize(s: str) -> List[str]:
s = (s or "").strip().lower()
if not s:
return []
try:
return nltk.word_tokenize(s)
except LookupError:
return re.findall(r"bw+b", s)
def bleu_score(ref: str, hyp: str) -> float:
r = safe_tokenize(ref)
h = safe_tokenize(hyp)
if len(h) == 0 or len(r) == 0:
return 0.0
return float(sentence_bleu([r], h, smoothing_function=smooth))
def rougeL_f1(ref: str, hyp: str) -> float:
scores = rouge.score(ref or "", hyp or "")
return float(scores["rougeL"].fmeasure)
def semantic_sim(ref: str, hyp: str) -> float:
embs = embedder.encode([ref or "", hyp or ""], normalize_embeddings=True)
return float(cosine_similarity([embs[0]], [embs[1]])[0][0])
We use LLM’s main query and evaluation metrics used to evaluate the quality of the information. We calculate BLEU, ROUGE-L, and semantic similarity scores to capture both high-level and semantic differences in model results. It allows us to assess rapid changes from multiple perspectives rather than relying on a single metric.
def evaluate_prompt(prompt_template: str) -> Tuple[pd.DataFrame, Dict[str, float], str]:
rows = []
for ex in EVAL_SET:
p = prompt_template.format(user_input=ex["input"])
y = call_llm(p)
ref = ex["reference"]
rows.append({
"id": ex["id"],
"input": ex["input"],
"reference": ref,
"output": y,
"bleu": bleu_score(ref, y),
"rougeL_f1": rougeL_f1(ref, y),
"semantic_sim": semantic_sim(ref, y),
})
df = pd.DataFrame(rows)
agg = {
"bleu_mean": float(df["bleu"].mean()),
"rougeL_f1_mean": float(df["rougeL_f1"].mean()),
"semantic_sim_mean": float(df["semantic_sim"].mean()),
}
outputs_jsonl = "n".join(json.dumps(r, ensure_ascii=False) for r in rows)
return df, agg, outputs_jsonl
def log_text_artifact(text: str, artifact_path: str):
mlflow.log_text(text, artifact_path)
def prompt_diff(old: str, new: str) -> str:
a = old.splitlines(keepends=True)
b = new.splitlines(keepends=True)
return "".join(difflib.unified_diff(a, b, fromfile="previous_prompt", tofile="current_prompt"))
def compute_regression_flags(baseline: Dict[str, float], current: Dict[str, float]) -> Dict[str, Any]:
d_sem = baseline["semantic_sim_mean"] - current["semantic_sim_mean"]
d_rouge = baseline["rougeL_f1_mean"] - current["rougeL_f1_mean"]
d_bleu = baseline["bleu_mean"] - current["bleu_mean"]
flags = {
"abs_semantic_fail": current["semantic_sim_mean"] < ABS_SEM_SIM_MIN,
"drop_semantic_fail": d_sem > DELTA_SEM_SIM_MAX_DROP,
"drop_rouge_fail": d_rouge > DELTA_ROUGE_L_MAX_DROP,
"drop_bleu_fail": d_bleu > DELTA_BLEU_MAX_DROP,
"delta_semantic": float(d_sem),
"delta_rougeL": float(d_rouge),
"delta_bleu": float(d_bleu),
}
flags["regression"] = any([flags["abs_semantic_fail"], flags["drop_semantic_fail"], flags["drop_rouge_fail"], flags["drop_bleu_fail"]])
return flags
We build a test and regression logic that runs each command against a set of tests and aggregates the results. We log data artifacts, data variables, and test results in MLflow, ensuring that all tests are always testable. We also enumerate regression flags that automatically indicate whether a data version degrades performance compared to the baseline. Check it out FULL CODES here.
print("Running prompt versioning + regression testing with MLflow...")
print(f"Tracking URI: {mlflow.get_tracking_uri()}")
print(f"Experiment: {mlflow.get_experiment_by_name('prompt_versioning_llm_regression').name}")
run_summary = []
baseline_metrics = None
baseline_prompt = None
baseline_df = None
baseline_metrics_name = None
with mlflow.start_run(run_name=f"prompt_regression_suite_{int(time.time())}") as parent_run:
mlflow.set_tag("task", "prompt_versioning_regression_testing")
mlflow.log_param("model", MODEL)
mlflow.log_param("temperature", TEMPERATURE)
mlflow.log_param("max_output_tokens", MAX_OUTPUT_TOKENS)
mlflow.log_param("eval_set_size", len(EVAL_SET))
for pv in PROMPTS:
ver = pv["version"]
prompt_t = pv["prompt"]
with mlflow.start_run(run_name=ver, nested=True) as child_run:
mlflow.log_param("prompt_version", ver)
log_text_artifact(prompt_t, f"prompts/{ver}.txt")
if baseline_prompt is not None and baseline_metrics_name is not None:
diff = prompt_diff(baseline_prompt, prompt_t)
log_text_artifact(diff, f"prompt_diffs/{baseline_metrics_name}_to_{ver}.diff")
else:
log_text_artifact("BASELINE_PROMPT (no diff)", f"prompt_diffs/{ver}.diff")
df, agg, outputs_jsonl = evaluate_prompt(prompt_t)
mlflow.log_dict(agg, f"metrics/{ver}_agg.json")
log_text_artifact(outputs_jsonl, f"outputs/{ver}_outputs.jsonl")
mlflow.log_metric("bleu_mean", agg["bleu_mean"])
mlflow.log_metric("rougeL_f1_mean", agg["rougeL_f1_mean"])
mlflow.log_metric("semantic_sim_mean", agg["semantic_sim_mean"])
if baseline_metrics is None:
baseline_metrics = agg
baseline_prompt = prompt_t
baseline_df = df
baseline_metrics_name = ver
flags = {"regression": False, "delta_bleu": 0.0, "delta_rougeL": 0.0, "delta_semantic": 0.0}
mlflow.set_tag("regression", "false")
else:
flags = compute_regression_flags(baseline_metrics, agg)
mlflow.log_metric("delta_bleu", flags["delta_bleu"])
mlflow.log_metric("delta_rougeL", flags["delta_rougeL"])
mlflow.log_metric("delta_semantic", flags["delta_semantic"])
mlflow.set_tag("regression", str(flags["regression"]).lower())
for k in ["abs_semantic_fail","drop_semantic_fail","drop_rouge_fail","drop_bleu_fail"]:
mlflow.set_tag(k, str(flags[k]).lower())
run_summary.append({
"prompt_version": ver,
"bleu_mean": agg["bleu_mean"],
"rougeL_f1_mean": agg["rougeL_f1_mean"],
"semantic_sim_mean": agg["semantic_sim_mean"],
"delta_bleu_vs_baseline": float(flags.get("delta_bleu", 0.0)),
"delta_rougeL_vs_baseline": float(flags.get("delta_rougeL", 0.0)),
"delta_semantic_vs_baseline": float(flags.get("delta_semantic", 0.0)),
"regression_flag": bool(flags["regression"]),
"mlflow_run_id": child_run.info.run_id,
})
summary_df = pd.DataFrame(run_summary).sort_values("prompt_version")
print("n=== Aggregated Results (higher is better) ===")
display(summary_df)
regressed = summary_df[summary_df["regression_flag"] == True]
if len(regressed) > 0:
print("n🚩 Regressions detected:")
display(regressed[["prompt_version","delta_bleu_vs_baseline","delta_rougeL_vs_baseline","delta_semantic_vs_baseline","mlflow_run_id"]])
else:
print("n✅ No regressions detected under current thresholds.")
if len(regressed) > 0 and baseline_df is not None:
worst_ver = regressed.sort_values("delta_semantic_vs_baseline", ascending=False).iloc[0]["prompt_version"]
worst_prompt = next(p["prompt"] for p in PROMPTS if p["version"] == worst_ver)
worst_df, _, _ = evaluate_prompt(worst_prompt)
merged = baseline_df[["id","output","bleu","rougeL_f1","semantic_sim"]].merge(
worst_df[["id","output","bleu","rougeL_f1","semantic_sim"]],
on="id",
suffixes=("_baseline", f"_{worst_ver}")
)
merged["delta_semantic"] = merged["semantic_sim_baseline"] - merged[f"semantic_sim_{worst_ver}"]
merged["delta_rougeL"] = merged["rougeL_f1_baseline"] - merged[f"rougeL_f1_{worst_ver}"]
merged["delta_bleu"] = merged["bleu_baseline"] - merged[f"bleu_{worst_ver}"]
print(f"n=== Per-example deltas: baseline vs {worst_ver} (positive delta = worse) ===")
display(
merged[["id","delta_semantic","delta_rougeL","delta_bleu","output_baseline",f"output_{worst_ver}"]]
.sort_values("delta_semantic", ascending=False)
)
print("nOpen MLflow UI (optional) by running:")
print("!mlflow ui --backend-store-uri file:/content/mlruns --host 0.0.0.0 --port 5000")
We design a fast workflow for regression testing using nested MLflow runs. We compare each version of the data against the baseline, log metric deltas, and record regression results in a structured summary table. This eliminates iteration, engineering grade line to quickly make versions and regression testing that we can scale to larger datasets and real-world applications.
In conclusion, we have developed a practical framework, focused on rapid transformation research and regression analysis that enables us to evaluate LLM behavior ethically and transparently. We showed how MLflow enables us to track evolution, compare results across versions, and automatically flag regressions based on well-defined thresholds. This approach helps us to move from rapid optimization to measurable, repeatable testing. By adopting this workflow, we ensured that fast updates improve the behavior of the model on purpose instead of introducing hidden performance regressions.
Check it out FULL CODES here. Also, feel free to follow us Twitter and don’t forget to join our 100k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.



