In our book Efficient PyTorch, we gave a quick overview of the main sharding strategies used in large-scale distributed training: data parallelism, tensor parallelism, pipeline parallelism, and a few Transformer-specific ones like expert parallelism and context parallelism.

Pipeline parallelism? It barely got a page.

At the time, I thought it was too intuitive to need much detail. Then last week, I tried explaining all the existing schedules in a clean, logical way—and completely hit a wall.

Those neat diagrams in papers always make it look so straightforward… until you dive in. Then it turns into tangled schedules, weird idle bubbles, and code that’s tightly coupled with other components.

So here’s the chapter that probably should’ve been in the book: an intuition-first, no-formula guide to pipeline parallelism—what it is, why it’s tricky, and how we ended up with so many different flavors of it.

Bonus point: a toy PP implementation for all schedules

While writing this post, I realized it was surprisingly hard to compare pipeline schedules across papers—everyone presents their plots in slightly different setups, which makes apples-to-apples comparisons tricky. As a result I went down the rabbit hole and built a toy simulator called minPP.

It’s a lightweight, CPU-only tool that lets you:

  • Define arbitrary partitioning + assignment + execution combos
  • Simulate step-by-step microbatch flows in timeline
  • Visualize schedules in a consistent format

It’s just under 1,000 lines of code, definitely not production grade, still has some rough edges that I haven’t got time to fix. But it helped me a lot in understanding and comparing schedules in a clean, structured way. And more importantly: every plot in this post is generated using it.

Honestly, I probably spent more time writing the simulator than I would’ve spent drawing the schedules by hand. But hey, now it’s reusable 😅.

Key questions in pipeline parallelism

Large language models (LLMs) are basically giant directed acyclic graphs (DAGs) of layers—a.k.a. compute graphs in most ML frameworks. Data flows through these graphs step by step, like items on a conveyor belt.

When the model gets too big to fit on a single GPU, but you have multiple GPUs. A natural idea is:

Break the model into chunks (a few layers each), assign each chunk to a GPU, and stream the inputs through them.

Congratulations—you’ve pipelined your model.But actually making that work in practice boils down to answering three key questions in this section.

To keep things simple, we’ll use the following setup throughout this post:

  • 8 layers in the model
  • 4 identical GPUs

This makes it easier to compare different partitioning, assignment, and execution strategies side by side.

How should we partiton the model?

A straightforward partition might be just partition the model to 4 parts:

  • [0, 1]: from layer 0 to 1
  • [2, 3]: from layer 2 to 3
  • [4, 5]: from layer 4 to 5
  • [6, 7]: from layer 5 to 7

Each group of consecutive layers like above is called a pipeline stage. It’s a self-contained unit that handles both forward and backward passes of layers. We use a bracket [x, y] to represent a stage’s start & end layer(inclusive).

Note that we currently have 2 layers per stage, but you could do 1 layer per stage too, resulting in 8 stages in total. We will see in later sections on how schedules leverage this finer granularity partitions.

How do we assign stages to GPUs?

Once you’be defined your stages, it’s time to assign them to GPUs. The simplest approach is one stage per GPU and assign them sequentially.

But what if you have more stages than GPUs, the assignment ordering can be quite flexible and potentially more interesting. Looping and vshape assignment are the mostly commonly used ones. If we do 8 stages in total(1 layer per stage) they look like below:

Looping assignmentVshape assignment
GPU 0[0], [4][0], [7]
GPU 1[1], [5][1], [6]
GPU 2[2], [6][2], [5]
GPU 3[3], [7][3], [4]

Note that there’s probably no single “correct” assignment, the way you assign stages to device can impact load balance, communication patterns and even how well you can overlap communication.

How do we priortize different microbatches?

To avoid idle GPUs (aka pipeline bubbles), we split the input batch into microbatches and schedule them across pipeline stages. Once stages are assigned to devices, we know exactly how a microbatch flows through across the devices. But given we have multiple microbatch in flight and each of them might have forward and backward pass to run, and usually the execution takes the GPU exclusively. Here is the main challenge: how do we priortize local execution on each GPU?

Say forward(i) is the forward pass of microbatch i. On a given GPU:

  • forward(i) must run before backward(i)
  • forward(i) must run before forward(i+1)

But there’s still a lot of room for ordering, for example, when both backward(i) and forward(i+1) are ready, which do you run first?

In general you have two choices here:

OptionDescriptionPriority
Depth firstRun backward(i) firstPush one microbatch through all stages
Breadth firstRun forward(i+1) firstProcess all microbatches in one stage

This choice defines your execution schedule.

So far, we’ve assumed each stage’s forward or backward execution takes the whole GPU exclusively. But some schedules (like DualPipe) can run 1 foward and 1 backward concurrently, and that changes the picture entirely. More on that soon.

Ready for some real schedules?

Now that we’ve nailed the three knobs: partitioning, assignment, and execution—let’s look at some real schedules.

I’ve always liked those timeline figures in papers. They’re well thought out and show how microbatches flow through devices. But they can be hard to follow when the stage-to-device mapping is hidden. That mapping is usually straightforward, but once you start doing more advanced partitioning or stage allocation, it introduces an extra layer of indirection and makes understanding hard.

To keep things simple, we’ll collapse the layer → stage → device mapping into a direct layer → device mapping. We will continue using the stage notation above. If multiple stages are assigned to one device, they are separated by a comma.

As every pipeline parallelism tutorial seems to do, let’s start with Gpipe and 1F1B, and see if cleanly separating these three questions (partitioning, assignment, execution) helps us better understand how it all works.

Gpipe and 1F1B

Gpipe1F1B
num_stage_per_GPU11
num_layers_per_stage22
GPU 0[0, 1][0, 1]
GPU 1[2, 3][2, 3]
GPU 2[4, 5][4, 5]
GPU 3[6, 7][6, 7]
Execution orderBFSDFS

You can see from the table above that Gpipe and 1F1B only differs in how they priortize execution locally.

Gpipe

Breadth-first scheduling (BFS) is great for small batch size training. It’s simple to implement and easy to reason about.

But there’s a catch: since BFS keeps all microbatches in flight before any backward pass starts, the memory used for storing intermediate activations builds up. All those tensors need to stay alive until the corresponding backward happens.

So while BFS based Gpipe is efficient in terms of throughput, it puts a lot of pressure on peak memory consumption.

You can apply activation checkpointing to reduce memory usage. But that comes at the cost of additional recomputation and slower runtime.

Gpipe schedule

1F1B

On the other hand, DFS is much more memory friendly. Switching Gpipe to use DFS is what we called original 1F1B schedule. Each rank runs 1 Forward + 1 Backward in alternating fashion. DFS allows each GPU to prioritize backward passes over forward passes when both are ready, which reduces the number of activations that need to be held in memory. As noted in the figure.

1F1B schedule

It’s worth noting: while 1F1B reduces peak memory usage, it doesn’t fix the pipeline bubble problem. You still end up with idle time in the early and late stages of the pipeline.

Algorithm based optimizations

Before we get to more complex schedules, there’re also optimizations coming from some prior knowledge in model architecture that we can leverage. These optimizations can be enabled on top of the 1F1B schedule easily.

ScheduleIdea
Zero bubble 1F1Bpriortize dgrad to execute first since it’s on critical path
Eager 1F1Btakes into data transfer time into accout
DualPiperun 1 fwd and 1 bwd simultaneously by overlapping comm & compute

ZB1F1B

ZeroBubble 1F1B takes things a step further by looking inside the backward pass.

Each backward consists of two major compute ops:

  • dgrad (gradient w.r.t. inputs): needed immediately to unblock the next stage
  • wgrad (gradient w.r.t. weights): used only locally

The key insight here is only dgrad is on the critical path. So we can deprioritize wgrad and use it to fill in the idle gaps between critical ops.

Zero bubble 1F1B schedule

This trick can be applied to most schedules—but it’s especially helpful for DFS-style ones, where backward is already prioritized.

Eager 1F1B

In all the schedules above, we’ve assumed that inter-device communication is instantaneous. As soon as stage i finishes its output, stage i+1 can start right away.

This is a useful simplification for understanding schedules, but in practice, communication time isn’t negligible.

Eager 1F1B addresses this by kicking off computations as early as possible to create opportunities for overlapping communication and computation.

  • Helps overlap forward compute and communication
  • Does not improve backward overlap
  • ️Increases memory usage due to more in-flight microbatches

So, like many things in pipeline parallelism, it’s a trade-off: you gain throughput by overlapping forward stages, but at the cost of higher memory consumption.

Eager 1F1B schedule

DualPipe: Bidirectional 1F1B

DualPipe (extended from Chimera) takes 1F1B optimization in a different direction—literally.

Instead of running a single dataflow from the first stage to the last, it runs two 1F1B schedules simultaneously in opposite directions. It overlays them into one bidirectional pipeline.

DualPipe
num_layers_per_stage2
num_stage_per_GPU2
GPU 0[0, 1], [6, 7]
GPU 1[2, 3], [4, 6]
GPU 2[4, 5], [3, 2]
GPU 3[6, 7], [0, 1]
Execution orderDFS

Sounds a bit confusing? Let’s break it down.

  • DualPipe is designed to make better use of low-bandwidth interconnects between devices.
  • It does this by running a forward pass and a backward pass at the same time, but:
    • The forward and backward passes are from different microbatches
    • And from different pipeline directions (normal vs. reversed)

This trick works well in large sparse MoE models, where the All2All communication and GEMM compute take roughly the same amount of time—allowing communication and computation to overlap cleanly.

Of course, nothing comes for free.

  • DualPipe doubles the memory footprint, since each device now stores 4 layers instead of 2.
  • But in return, it achieves better compute-communication overlap and higher pipeline utilization.

I didn’t really implement this schedule, as acknowledged in the DualPipe repo, this schedule can be greatly simplified to DualPipeV schedule we will show below.

Advanced schedules

Interleaved virtual pipelineVshape zero bubbleDualPipeVBFS-looping
num_layers_per_stage1111
num_stage_per_GPU2222
assignmentloopingvshapevshapelooping
GPU 0[0], [4][0], [7][0],[7][0],[4]
GPU 1[1], [5][1], [6][1],[6][1],[5]
GPU 2[2], [6][2], [5][2],[5][2],[6]
GPU 3[3], [7][3], [4][3],[4][3],[7]
Execution orderDFSDFSDFSBFS
Optimizationconsider data transfer timezero bubbleEPNone

Interleaved virtual pipeline (Megatron)

The MegatronLM paper introduced a clever way to reduce pipeline bubbles. By dividing the model into smaller stages and assigning multiple stages per device, you can shrink idle time. This schedule is called interleaved 1F1B.

In the example interleaved virtual pipeline above, we basically now have 1 layer per stage but 2 stages per device. To finish one microbatch forward, you’ll need to loop over all devices twice(marked in the image). The bubble time is smaller, that comes at the cost of more communication as we have more pipeline stages, we also need to communicate more across devices.

Note in interleaved 1F1B, when we have assign stages to devices, we assign stages by looping over devices. This creates a memory bottleneck on the first device and under utilization of memory on all other device. This imbalance in memory usage can be mitigated by the vshape schedule below.

Also note that interleaved is also takes in data transfer time into account and tries to overlap communication and compute as much as possible.

Interleaved virtual pipeline schedule

Vshape-ZB

The lifespan of a stage is defined as the time between the start of its forward pass and the end of its backward pass.

In both Interleaved 1F1B and V-shape, the execution schedule is the same. The only difference lies in how stages are assigned to devices.

Since peak memory usage is proportional to the stage’s lifespan, the default interleaved schedule (which assigns stages round-robin) can lead to memory hotspots, especially on the first device.

The V-shape assignment fixes this by intentionally collocating:

  • Stages with long lifespans
  • Stages with short lifespans

This creates a more balanced memory footprint across all devices, reducing bottlenecks without changing the overall schedule behavior.

Vshape zb schedule

Note we picked vshape-zb schedule in the paper to show here which also employs zero bubble optimization abov

DualPipeV

As mentioned above, the 2x memory usage in DualPipe schedule is actually quite expensive. DualPipeV actually shows that the dual memory usuage is not necessary and it can be simplified by cutting in half. Then it can be viewed as a variant of vshape-zb schedule above by running 1 forward and 1 backward simultaneously.

DualpipeV schedule

BFS-looping

While DFS-based schedules (like 1F1B and its variants) dominate in practice, they come with a key limitation. They require a large number of microbatches to fully utilize the pipeline—which isn’t always feasible.

To address this, breath-first PP paper proposed an alternative:

  • Use similar virtual pipeline stages as in interleaved but use BFS (instead of DFS) for execution
  • Combine it with FSDP and activation checkpointing to reduce peak memory usage.

BFS looping schedule This combo turns out to be helpful in small batch size settings, where DFS would otherwise stall or waste resources due to insufficient microbatches.

Common Questions

Q: What if a single layer can’t fit on a GPU?

A: Then PP isn’t enough. You’ll need tensor parallelism (TP), where multiple GPUs jointly run one operation (e.g. matrix multiply).

Q: Why is scheduling so hard?

A: Because you’re solving a data dependency scheduling problem under communication and compute constraints. Minimizing idle time while respecting order is… hard.

Q: Why not automate it?

A: Great idea! In fact, many teams are trying. But a proper scheduler needs to know model structure, compute cost, activation sizes, hardware bandwidths, and more. It’s a compiler problem—but compilers for distributed training are still in early days.

Closing Thoughts

I don’t really expect this post to be as detailed as the papers so I’ve intentionally excluded all the formulas here. But hopefully it gives you enough intuition to navigate them with more confidence.

At its core, pipeline parallelism is a simple idea. But the implementation is a surprisingly deep rabbit hole.