DocFX v3: how we made the docs build up to 70× faster

DocFX v3 builds the same docs up to 70× faster than v2 — and we didn’t get there by finding a slow function and fixing it. We made one architectural commitment, immutable data, and parallelism, caching, and streaming all fell out of it for free. The micro-optimizations came last and mattered least.

TL;DR


The naive read: “they rewrote it and optimized a lot”

If you skim the v3 changelog you’ll see exactly what you’d expect from a perf rewrite: a faster git binding, a faster HTML parser, a faster JavaScript engine, a faster glob library, and a long tail of profiler-driven micro-optimizations. The tempting conclusion is that v3 is fast because someone went through v2 with a profiler and replaced the slow parts.

That read isn’t wrong — those swaps happened and they helped — but it gets the causality backwards. You could have done every one of those component swaps in v2 and still been slow, because the thing that was slow about v2 wasn’t any particular component. It was the shape of the build: a stage-by-stage batch process built on shared mutable state. That shape caps how much of the machine you can use and how much work you can avoid, and you can’t profile your way out of it. You have to change the shape.

The one decision that changes the shape — and the one this whole note is about — is committing to immutable data.


Why immutability is the keystone

Here’s the non-obvious part. Immutability sounds like a correctness concern, a discipline you adopt to avoid race-condition bugs. It is that. But in v3 it’s load-bearing for performance, because two of the biggest speedups are things you simply cannot do safely without it.

Parallelism falls out of immutability. The goal is that any function in the codebase can be called by any thread at any time — that’s what lets a build saturate a 16-core machine instead of marching through stages one at a time. But “any thread, any function, any time” is only safe if those functions aren’t fighting over mutable state. The moment two threads share a dictionary one of them might write, you’re back to locks, and locks are where parallelism goes to die. Immutable data removes the premise of the race: there’s nothing to guard because there’s nothing to change. v3 leans on this everywhere — a dedicated ParallelUtility, plus Parallel.ForEach/Parallel.Invoke across restore, build, template loading, and TOC loading.

Thread-safety isn’t a feature you add to a parallel system. It’s a property of the data. Make the data immutable and the parallelism is almost free; leave it mutable and no amount of locking buys it back cheaply.

Caching falls out of immutability too. A cache is a bet that the thing you stored is still the thing you’d compute. With mutable objects that bet is fragile — someone holds a reference, mutates it, and now every cache entry pointing at it is silently wrong. With immutable objects the bet is free: nobody can change the value, so a cached value is correct forever. That’s why v3 caches so aggressively. Not just the obvious API responses, but most non-trivial operations that happen more than once: reading a file’s metadata, turning a markdown property into HTML. The same property that made the data safe to share across threads makes it safe to memoize across calls.

This is the part you can’t retrofit. You can swap a glob library in an afternoon. You cannot make a mutable-by-default codebase immutable in an afternoon — every shared structure, every “I’ll just update this field” becomes a question. That’s the honest reason v3 is a rewrite and not a v2 point release.


Streaming replaces the stage-by-stage batch

v2 processed in stages: every file finished stage N before any file entered stage N+1. The problem with a batch model isn’t subtle once you see it — it pins the whole working set in memory until the build ends. For a big repo (archive, reference) that means spilling to swap files on disk, and once you’re swapping, every operation is orders of magnitude slower. The batch model also wastes the machine: a stage that only reads or writes files does almost no CPU work, so your cores idle through it.

v3 is a streaming processor instead. A file flows through the whole pipeline rather than waiting at stage boundaries, which means CPU-bound and IO-bound work overlap naturally — while one file is being parsed, another is being read off disk. The working set stays steady regardless of repo size, so performance is predictable and you don’t fall off the swap cliff. (This is the same reason streaming beats batch in data pipelines generally; docs builds are not special here, v2 just happened to be batch.)

  v2: stage-by-stage batch v3: streaming
Memory grows with repo size, spills to swap steady working set
CPU/IO serialized per stage, cores idle on IO stages overlapped, pipelined
Scaling falls off a cliff on big repos predictable

Separate the network from the compute

The other structural win is mundane and high-leverage: stop interleaving network calls with computation.

v2 mixed them — validating a file might mean calling a validation service, resolving a cross-reference might mean calling an xref service, and it did this per file. N files, N round-trips, each one a place to stall. v3 splits the network work into a separate restore step (the mental model is npm install): download all the validation rules once, download all the xref uids once, then run a build that is mostly pure compute and resolves everything locally. N calls collapse to 1.

For the calls you genuinely can’t hoist — validating a GitHub alias against the GitHub API, say — v3 caches them on disk (JsonDiskCache) and shares the cache across builds. That does double duty: it’s faster, and it sidesteps GitHub API rate limits that would otherwise throttle a big build. Note that this only works because of the immutability commitment from two sections up — the cached results are safe to share precisely because nothing mutates them.


The component swaps (the last 20%)

These are real and they helped, but notice they’re all local — each one replaces a slow part with a fast part without changing the build’s shape. That’s exactly why they’re the easy, last-mile wins rather than the headline.

And a long tail of profiler-driven micro-optimizations, found by running VS profilers across different build scenarios. A representative few:


Honest limits

A few things this note is careful not to claim:


If you take only three things

  1. The speedup was a shape change, not a component change. v2 was a stage-by-stage batch on mutable shared state; that’s the ceiling, and you can’t profile through it.
  2. Immutability is the keystone because it pays off twice. Same property makes data safe to share across threads (parallelism) and safe to memoize across calls (caching). That double payoff is what made the architectural bet worth it.
  3. Do the easy wins, but know they’re easy. Faster git, faster HTML, faster JS, and micro-opts are worth doing — and they’re the part you can get without a rewrite. Reach for the rewrite only when the slow thing is the shape itself.