DocFX v3: how we made the docs build up to 70× faster

March 1, 2021

DocFX v3 builds the same docs up to 70× faster than v2 — and we didn’t get there by finding a slow function and fixing it. We made one architectural commitment, immutable data, and parallelism, caching, and streaming all fell out of it for free. The micro-optimizations came last and mattered least.

TL;DR

DocFX v3 is a ground-up rewrite of v2 that builds the same content up to 70× faster on large repos. The headline speedup is architectural, not a pile of clever tricks.
The keystone decision was immutable data from day one. It’s the one choice that simultaneously unlocks three independent wins — and the one thing you cannot retrofit, which is why it needed a rewrite rather than a patch.
Parallelism falls out of immutability: if nothing mutates shared state, any thread can call any function at any time. Caching falls out too: an immutable object is safe to memoize because nobody can change it underneath you.
Streaming replaces v2’s stage-by-stage model, trading a memory-pinning batch process for a steady working set that doesn’t spill to swap on large repos.
Separating network from compute turns N calls into 1: a restore step (think npm install) downloads xrefs and validation rules once, then the build is pure compute.
Component swaps (libgit2, a streaming HTML reader, ChakraCore) and profiler-driven micro-opts are real, but they’re the last 20% — the kind of win you can also get without a rewrite.

The naive read: “they rewrote it and optimized a lot”

If you skim the v3 changelog you’ll see exactly what you’d expect from a perf rewrite: a faster git binding, a faster HTML parser, a faster JavaScript engine, a faster glob library, and a long tail of profiler-driven micro-optimizations. The tempting conclusion is that v3 is fast because someone went through v2 with a profiler and replaced the slow parts.

That read isn’t wrong — those swaps happened and they helped — but it gets the causality backwards. You could have done every one of those component swaps in v2 and still been slow, because the thing that was slow about v2 wasn’t any particular component. It was the shape of the build: a stage-by-stage batch process built on shared mutable state. That shape caps how much of the machine you can use and how much work you can avoid, and you can’t profile your way out of it. You have to change the shape.

The one decision that changes the shape — and the one this whole note is about — is committing to immutable data.

Why immutability is the keystone

Here’s the non-obvious part. Immutability sounds like a correctness concern, a discipline you adopt to avoid race-condition bugs. It is that. But in v3 it’s load-bearing for performance, because two of the biggest speedups are things you simply cannot do safely without it.

Parallelism falls out of immutability. The goal is that any function in the codebase can be called by any thread at any time — that’s what lets a build saturate a 16-core machine instead of marching through stages one at a time. But “any thread, any function, any time” is only safe if those functions aren’t fighting over mutable state. The moment two threads share a dictionary one of them might write, you’re back to locks, and locks are where parallelism goes to die. Immutable data removes the premise of the race: there’s nothing to guard because there’s nothing to change. v3 leans on this everywhere — a dedicated ParallelUtility, plus Parallel.ForEach/Parallel.Invoke across restore, build, template loading, and TOC loading.

Thread-safety isn’t a feature you add to a parallel system. It’s a property of the data. Make the data immutable and the parallelism is almost free; leave it mutable and no amount of locking buys it back cheaply.

Caching falls out of immutability too. A cache is a bet that the thing you stored is still the thing you’d compute. With mutable objects that bet is fragile — someone holds a reference, mutates it, and now every cache entry pointing at it is silently wrong. With immutable objects the bet is free: nobody can change the value, so a cached value is correct forever. That’s why v3 caches so aggressively. Not just the obvious API responses, but most non-trivial operations that happen more than once: reading a file’s metadata, turning a markdown property into HTML. The same property that made the data safe to share across threads makes it safe to memoize across calls.

This is the part you can’t retrofit. You can swap a glob library in an afternoon. You cannot make a mutable-by-default codebase immutable in an afternoon — every shared structure, every “I’ll just update this field” becomes a question. That’s the honest reason v3 is a rewrite and not a v2 point release.

Streaming replaces the stage-by-stage batch

v2 processed in stages: every file finished stage N before any file entered stage N+1. The problem with a batch model isn’t subtle once you see it — it pins the whole working set in memory until the build ends. For a big repo (archive, reference) that means spilling to swap files on disk, and once you’re swapping, every operation is orders of magnitude slower. The batch model also wastes the machine: a stage that only reads or writes files does almost no CPU work, so your cores idle through it.

v3 is a streaming processor instead. A file flows through the whole pipeline rather than waiting at stage boundaries, which means CPU-bound and IO-bound work overlap naturally — while one file is being parsed, another is being read off disk. The working set stays steady regardless of repo size, so performance is predictable and you don’t fall off the swap cliff. (This is the same reason streaming beats batch in data pipelines generally; docs builds are not special here, v2 just happened to be batch.)

	v2: stage-by-stage batch	v3: streaming
Memory	grows with repo size, spills to swap	steady working set
CPU/IO	serialized per stage, cores idle on IO stages	overlapped, pipelined
Scaling	falls off a cliff on big repos	predictable

Separate the network from the compute

The other structural win is mundane and high-leverage: stop interleaving network calls with computation.

v2 mixed them — validating a file might mean calling a validation service, resolving a cross-reference might mean calling an xref service, and it did this per file. N files, N round-trips, each one a place to stall. v3 splits the network work into a separate restore step (the mental model is npm install): download all the validation rules once, download all the xref uids once, then run a build that is mostly pure compute and resolves everything locally. N calls collapse to 1.

For the calls you genuinely can’t hoist — validating a GitHub alias against the GitHub API, say — v3 caches them on disk (JsonDiskCache) and shares the cache across builds. That does double duty: it’s faster, and it sidesteps GitHub API rate limits that would otherwise throttle a big build. Note that this only works because of the immutability commitment from two sections up — the cached results are safe to share precisely because nothing mutates them.

The component swaps (the last 20%)

These are real and they helped, but notice they’re all local — each one replaces a slow part with a fast part without changing the build’s shape. That’s exactly why they’re the easy, last-mile wins rather than the headline.

Git history: replaced shelling out to git.exe with direct libgit2 bindings, so commit-history extraction doesn’t pay process-spawn overhead.
HTML: replaced the DOM-based HtmlAgilityPack with a streaming, ref-struct HtmlReaderWriter that runs close to memory speed and allocates almost nothing.
JavaScript: replaced JINT (a managed interpreter) with ChakraCore (a native engine) — on Windows. JINT stays as the cross-platform fallback on Linux/macOS until ChakraCore ships native packages there. Worth knowing if your build agents aren’t Windows.
Glob: replaced a regex-based glob library with GlobExpressions, fronted by a hand-written KnownGlob fast path for the common shapes (**/*.md, includes/**) so the frequent cases skip the general engine. (Regex isn’t fully gone — it’s still used for pattern preprocessing — so “we removed regex” would overstate it.)
Templating: replaced Nustache with its maintained successor, Stubble.

And a long tail of profiler-driven micro-optimizations, found by running VS profilers across different build scenarios. A representative few:

Remove unneeded data transfer between components — #5718
Pre-warm expensive tasks for better parallelism — #6130
Avoid unneeded large string allocation — #2974
Skip unneeded work smartly — #5421

Honest limits

A few things this note is careful not to claim:

Immutability isn’t literally everywhere in the code. The design principle is immutability-first, and that’s what unlocks the parallelism and caching. But in practice the codebase also leans on ConcurrentDictionary and custom builder types for the hot paths where a pure-immutable structure would be too slow to allocate. “Immutable by default, mutable where measured” is the honest description — not “no mutable state anywhere.”
ChakraCore is a Windows-only win. On non-Windows agents you’re on the JINT fallback, so the JS-heavy parts of a build won’t see the same speedup.
The rewrite cost is real. Everything here is an argument for why the rewrite paid off, not a claim that rewrites usually do. The keystone insight — that immutability was un-retrofittable — is exactly what justified the cost here. If your slow system’s shape is already fine, swap components and profile; don’t rewrite.

If you take only three things

The speedup was a shape change, not a component change. v2 was a stage-by-stage batch on mutable shared state; that’s the ceiling, and you can’t profile through it.
Immutability is the keystone because it pays off twice. Same property makes data safe to share across threads (parallelism) and safe to memoize across calls (caching). That double payoff is what made the architectural bet worth it.
Do the easy wins, but know they’re easy. Faster git, faster HTML, faster JS, and micro-opts are worth doing — and they’re the part you can get without a rewrite. Reach for the rewrite only when the slow thing is the shape itself.

Source: notes/docfx-v3-performance-rewrite.md @ 15cd198