Domain 2 β€” Module 4 of 8 50%
9 of 25 overall
Domain 2: Implement Machine Learning Model Lifecycle and Operations Free ⏱ ~12 min read

Distributed Training: Scale to Big Data

When your data or model doesn't fit on one machine, distribute it. Learn data parallelism, model parallelism, and how to configure distributed training in Azure ML.

Why distributed training?

Simple explanation

Imagine cooking dinner for 1,000 people.

One chef in one kitchen? Impossible. Instead, you split the work: one team preps vegetables (data parallel β€” same recipe, different ingredients). Or for a massive wedding cake, different teams build different layers at the same time (model parallel β€” different parts of the same thing).

Distributed training splits the work across multiple GPUs or machines so training that would take days on one machine takes hours across many.

Data parallelism vs model parallelism

Two distributed training strategies
FeatureHow It WorksWhen To UseScaling Limit
Data ParallelismCopy the model to each GPU; split the dataset across GPUs. Each GPU processes a mini-batch; gradients are averaged.Dataset is large but model fits on one GPU. Most common approach.Limited by gradient synchronization overhead at high node counts.
Model ParallelismSplit the model across GPUs β€” each GPU holds different layers or parameters.Model is too large for one GPU (large language models, foundation models).Complex setup; inter-GPU communication is the bottleneck.

Configuring distributed training in Azure ML

Data parallelism with PyTorch

from azure.ai.ml import command
from azure.ai.ml.entities import ResourceConfiguration

# Define a distributed training job
distributed_job = command(
    code="./src",
    command="python train_distributed.py --epochs 50",
    environment="azureml:pytorch-training:1",
    compute="gpu-training-cluster",
    distribution={
        "type": "PyTorch",
        "process_count_per_instance": 4,  # 4 GPUs per node
    },
    resources=ResourceConfiguration(
        instance_count=2,  # 2 nodes = 8 GPUs total
    ),
)

returned_job = ml_client.jobs.create_or_update(distributed_job)

What’s happening:

  • Lines 10-12: PyTorch distribution type β€” Azure ML sets up the communication backend automatically
  • Line 12: 4 processes per node (one per GPU)
  • Lines 14-15: 2 nodes with 4 GPUs each = 8 GPUs total processing data in parallel
  • Azure ML handles: node coordination, environment setup, distributed communication (NCCL backend)

The training script (data parallel)

# train_distributed.py
import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

# Azure ML sets environment variables for distributed setup
dist.init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
device = torch.device(f"cuda:{local_rank}")

# Wrap model in DDP
model = MyModel().to(device)
model = DDP(model, device_ids=[local_rank])

# DistributedSampler splits data across GPUs
sampler = torch.utils.data.distributed.DistributedSampler(dataset)
dataloader = DataLoader(dataset, sampler=sampler, batch_size=32)

# Training loop β€” same as single GPU, DDP handles gradient sync
for epoch in range(50):
    sampler.set_epoch(epoch)
    for batch in dataloader:
        # Forward, backward, step β€” DDP synchronizes automatically
        ...

What’s happening:

  • Line 7: Initialises distributed communication (Azure ML sets the environment variables)
  • Line 8: Each process gets assigned a GPU via LOCAL_RANK
  • Line 12: DistributedDataParallel wraps the model β€” handles gradient averaging across GPUs
  • Line 15: DistributedSampler ensures each GPU gets a different portion of the dataset
Scenario: Dr. Luca trains on genomics data across 16 GPUs

GenomeVault has a variant-calling model that takes 3 days to train on a single A100 GPU. The dataset is 2TB of genomic sequences.

Dr. Luca’s distributed training setup:

  • 4 nodes, each with 4 A100 GPUs = 16 GPUs total
  • Data parallelism β€” the model fits on one GPU but the dataset is massive
  • DistributedSampler splits the 2TB across all 16 GPUs
  • Training time: 3 days β†’ ~5 hours (14x speedup β€” not perfect linear due to communication overhead)

Prof. Sarah Lin: β€œWe can now iterate on model architectures weekly instead of monthly.”

Distributed training with TensorFlow/Horovod

distributed_job = command(
    code="./src",
    command="python train_tensorflow.py",
    environment="azureml:tensorflow-training:1",
    compute="gpu-training-cluster",
    distribution={
        "type": "TensorFlow",
        "worker_count": 8,  # 8 workers total
    },
    resources=ResourceConfiguration(
        instance_count=2,
    ),
)

For MPI-based frameworks (Horovod):

distribution={
    "type": "Mpi",
    "process_count_per_instance": 4,
}
Exam tip: Distribution types

The exam expects you to know which distribution type to use:

  • PyTorch: type: "PyTorch" β€” uses torch.distributed with NCCL backend
  • TensorFlow: type: "TensorFlow" β€” uses tf.distribute.Strategy
  • MPI: type: "Mpi" β€” for Horovod or custom MPI-based frameworks

The most common exam scenario involves PyTorch data parallelism with DistributedDataParallel.

Key considerations for distributed training

FactorImpactRecommendation
Batch sizeScale batch size with GPU count (linear scaling rule)If single GPU uses batch_size=32, 8 GPUs use batch_size=256
Learning rateScale with batch size or use warm-upLinear scaling rule: 8x batch = 8x learning rate (with warm-up)
CommunicationGradient sync is the bottleneckUse InfiniBand-enabled VMs (ND-series) for multi-node
CheckpointingSave model regularly in case of failuresEssential for multi-hour jobs β€” checkpoint every N epochs

Key terms flashcards

Question

Data parallelism vs model parallelism?

Click or press Enter to reveal answer

Answer

Data parallelism: copy model to each GPU, split dataset. Most common, best when model fits on one GPU. Model parallelism: split model across GPUs. Required when model is too large for one GPU.

Click to flip back

Question

What is DistributedDataParallel (DDP) in PyTorch?

Click or press Enter to reveal answer

Answer

A wrapper that replicates the model on each GPU and automatically synchronizes gradients during backpropagation. The standard way to do data-parallel training in PyTorch.

Click to flip back

Question

Why checkpoint during distributed training?

Click or press Enter to reveal answer

Answer

Multi-hour/multi-day jobs across multiple nodes are vulnerable to failures (node crashes, pre-emption). Checkpointing saves progress so training can resume from the last checkpoint instead of starting over.

Click to flip back

Knowledge check

Knowledge Check

Dr. Luca has a model that fits on a single GPU but a 2TB dataset. Training takes 3 days on one GPU. He has a cluster with 4 nodes, each with 4 GPUs. What strategy should he use?

Knowledge Check

Kai is configuring a PyTorch distributed training job in Azure ML. He needs 2 nodes with 4 GPUs each. What distribution configuration should he use?


Next up: Model Registration & Versioning β€” from experiment to production-ready artifact.