Domain 2 β€” Module 1 of 8 13%
6 of 25 overall
Domain 2: Implement Machine Learning Model Lifecycle and Operations Free ⏱ ~14 min read

MLflow: Track Every Experiment

If you can't track it, you can't reproduce it. Master MLflow experiment tracking β€” log metrics, parameters, and artifacts so every experiment is fully traceable.

What is MLflow?

Simple explanation

MLflow is like a lab notebook that writes itself.

In a science lab, you record every experiment: what you mixed, how much, what temperature, and what happened. Without notes, you can’t repeat a success or understand a failure.

MLflow does this automatically for ML experiments. Every time you train a model, it records: which data you used (parameters), how well it performed (metrics), and the actual model file (artifact). Weeks later, you can look up β€œwhich run got 94% accuracy?” and trace back to the exact code and data.

MLflow concepts

ConceptWhat It IsExample
ExperimentA named group of related runs”churn-prediction-v2”
RunA single execution of a training scriptOne training job with specific hyperparameters
ParameterAn input configuration valuelearning_rate=0.01, n_estimators=100
MetricA measured output valueaccuracy=0.94, loss=0.12
ArtifactA file produced by the runmodel.pkl, feature_importance.png, confusion_matrix.json
TagMetadata label”team=nlp”, β€œsprint=q2”, β€œgit_commit=abc123”

Logging with MLflow in Azure ML

When you run a training script in Azure ML, MLflow tracking is automatic. Here’s how to use it:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# MLflow auto-connects to your Azure ML workspace
# No manual server configuration needed

# Start a run
with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters (inputs)
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("dataset_version", "v2")

    # Train the model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)

    # Log metrics (outputs)
    predictions = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, predictions))
    mlflow.log_metric("f1_score", f1_score(y_test, predictions))

    # Log the model as an artifact
    mlflow.sklearn.log_model(model, "churn-model")

    # Log additional artifacts
    mlflow.log_artifact("feature_importance.png")

What’s happening:

  • Line 10: Opens a named run β€” everything logged inside this block is grouped together
  • Lines 12-14: Records the input configuration so you can reproduce this exact setup
  • Lines 22-23: Records how well the model performed
  • Line 26: Saves the model in MLflow’s standard format (can be deployed to any MLflow-compatible platform)
  • Line 29: Saves additional files (charts, reports) alongside the model
Scenario: Dr. Luca's reproducibility rescue

Dr. Luca Bianchi at GenomeVault ran 47 experiments over three weeks. His colleague asks: β€œWhich run produced the best F1 score, and can we reproduce it?”

Without MLflow: β€œUm, I think it was the one on Tuesday… let me check my notebooks…”

With MLflow:

# Find the best run across all experiments
runs = mlflow.search_runs(
    experiment_names=["genomics-variant-calling"],
    order_by=["metrics.f1_score DESC"],
    max_results=1
)
print(runs[["run_id", "params.model_type", "metrics.f1_score"]])

Result: Run abc123, model_type=gradient_boost, F1=0.967. Every parameter, the exact code commit (via Git tag), and the trained model are all traceable.

Prof. Sarah Lin: β€œThis is exactly the kind of rigour we need for our publications.”

Autologging

MLflow can automatically log parameters and metrics for popular frameworks β€” no manual log_param calls needed:

# Enable autologging for scikit-learn
mlflow.sklearn.autolog()

# Just train the model β€” MLflow captures everything
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)

Supported frameworks for autologging:

FrameworkWhat’s Auto-Logged
scikit-learnAll hyperparameters, metrics (accuracy, F1, etc.), model artifact
PyTorch / PyTorch LightningLoss per epoch, learning rate, model weights
TensorFlow / KerasEpoch metrics, optimizer config, model architecture
XGBoost / LightGBMBoosting params, feature importance, eval metrics
Spark MLPipeline stages, evaluator metrics
Exam tip: Autologging vs manual logging

Autologging is convenient but logs EVERYTHING. For production pipelines, manual logging gives you control over exactly what’s tracked.

The exam may ask when to use each:

  • Autologging: exploration, prototyping, when you want comprehensive tracking with no code changes
  • Manual logging: production pipelines, when you need specific metrics or custom artifacts

Comparing runs

One of MLflow’s most powerful features is comparing runs side by side:

# Search and compare runs
import mlflow

runs = mlflow.search_runs(
    experiment_names=["churn-prediction-v2"],
    filter_string="metrics.accuracy > 0.90",
    order_by=["metrics.f1_score DESC"]
)

# View top runs
print(runs[["run_id", "params.n_estimators", "params.max_depth",
            "metrics.accuracy", "metrics.f1_score"]].head(5))

What’s happening:

  • Line 6: Filters to only runs with accuracy above 90%
  • Line 7: Sorts by F1 score (descending) β€” best runs first
  • Lines 10-12: Shows the key parameters and metrics for comparison

In the Azure ML Studio UI, you can also visually compare runs β€” select multiple runs and view metrics in parallel charts, scatter plots, or tables.

Scenario: Kai compares 200 sweep runs

Kai just ran a hyperparameter sweep with 200 trials (covered in Module 7). Now he needs to find the best model.

# Find the top 5 runs from the sweep
best_runs = mlflow.search_runs(
    experiment_names=["churn-sweep-apr-2026"],
    order_by=["metrics.f1_score DESC"],
    max_results=5
)

# Log the winner for the team
winner = best_runs.iloc[0]
print(f"Best run: {winner.run_id}")
print(f"  F1: {winner['metrics.f1_score']:.4f}")
print(f"  Learning rate: {winner['params.learning_rate']}")
print(f"  Max depth: {winner['params.max_depth']}")

Priya (CTO): β€œWhich model do we ship?” Kai: β€œRun 7f3a2b1 β€” F1 of 0.9612 with learning_rate=0.03 and max_depth=8.”

Key terms flashcards

Question

What are the three things MLflow tracks for every run?

Click or press Enter to reveal answer

Answer

Parameters (inputs like hyperparameters), Metrics (outputs like accuracy/loss), and Artifacts (files like model weights, charts, reports).

Click to flip back

Question

Do you need a separate MLflow server with Azure ML?

Click or press Enter to reveal answer

Answer

No β€” Azure ML workspaces include a built-in MLflow tracking server. Your workspace IS the tracking server. No additional setup needed.

Click to flip back

Question

What is MLflow autologging?

Click or press Enter to reveal answer

Answer

A feature that automatically logs parameters, metrics, and models for popular frameworks (sklearn, PyTorch, TensorFlow) β€” no manual log_param/log_metric calls needed.

Click to flip back

Question

How do you find the best run across many experiments?

Click or press Enter to reveal answer

Answer

mlflow.search_runs() with filter_string and order_by. Example: filter accuracy > 0.90, order by F1 score descending.

Click to flip back

Knowledge check

Knowledge Check

Dr. Luca ran 47 experiments over three weeks. His colleague asks which run produced the best F1 score. What tool should Luca use?

Knowledge Check

Kai wants comprehensive experiment tracking with minimal code changes during early prototyping. What should he enable?


Next up: AutoML & Hyperparameter Tuning β€” letting Azure find the best model for you.