Hi there 👋

Just some notes on ML systems and whatever else I find interesting. Opinions are my own.

Visualizing Parallelism in Transformer

Simplicity buried in Abstractions I’ve always loved the “Transformer Accounting” diagram from the JAX Scaling Book. It did a brilliant job of making the tensor shapes of a Transformer intuitive on a single device. But as we scale up, the complexity shifts. We stop worrying about just matrix dimensions and start worrying about the ‘alphabet soup’ of N-D parallelism (DP, TP, SP, CP, EP). Here is the irony: The core ideas behind these parallelisms are actually fundamentally easy. Conceptually, we are just decomposing a global tensor operation into local tensor compute chunks connected by communication collectives. It’s like an assembly line: instead of one worker building the whole car, we have a line of workers (GPUs) passing parts (tensors) back and forth. ...

Pipeline Parallelism Demystified

In our book Efficient PyTorch, we gave a quick overview of the main sharding strategies used in large-scale distributed training: data parallelism, tensor parallelism, pipeline parallelism, and a few Transformer-specific ones like expert parallelism and context parallelism. Pipeline parallelism? It barely got a page. At the time, I thought it was too intuitive to need much detail. Then last week, I tried explaining all the existing schedules in a clean, logical way—and completely hit a wall. ...