Specs as tests: how docfx tests a build pipeline with YAML

June 18, 2020

A unit test asserts that one function returns one value. A spec test asserts that a whole pipeline — Markdown in, JSON and HTML out — produces exactly this, for hundreds of cases, written by people who don’t read C#. The test suite that reads like documentation is the one you’ll actually keep correct.

TL;DR

The tests are YAML, not code. A docfx spec test is a flat inputs: → outputs: block. The whole case is data; the C# runner is shared infrastructure you write once and never touch again. A reviewer reads the spec the way they’d read a feature doc, because it is one.
One attribute turns a folder of YAML into a test matrix. [YamlTest("~/docs/specs/**/*.yml")] — from the team’s own Yunit library — globs the spec files and emits one xUnit case per ----separated document. Add a case by adding a YAML block; no new method, no [Fact], no registration.
One case runs as several builds. An ExpandTest hook re-runs the same spec under different build modes — a normal build, a --dry-run, a per-file build, a two-phase continue build — and asserts they all agree. You write the case once; the framework checks the invariants across modes for free.
The assertion is a smart JSON diff, not string equality. Wildcards, HTML-aware comparison, and log-line normalization let a spec pin down the fields that matter and ignore the noise — the difference between a golden-file suite that’s maintainable and one everybody learns to --update blindly.
The build is mocked at the seams, not the units. Git remotes, HTTP fetches, and credentials are swapped for in-memory fakes through a single TestQuirks indirection, so a test that exercises the real build end to end still runs offline and deterministically.

A hand-written test runs, but it isn’t a spec

You have a build pipeline. Markdown and YAML and a docfx.yml config go in; rendered JSON, HTML, a TOC, and an error log come out. You want tests. The obvious move is an xUnit test class: a [Fact] per scenario, each one news up a builder, writes some files to a temp directory, runs the build, and asserts on the output.

This works for ten cases. By a hundred it has rotted in a specific, predictable way. Every test is fifty lines of File.WriteAllText / Build() / Assert, ninety percent of it identical boilerplate, with the actual scenario — this Markdown should produce that HTML — buried in the middle.

But the boilerplate is the lesser problem. The deeper one is that the test isn’t readable as a contract. The person who best knows whether the expected output is right is often a content author or a docs PM — and you’ve written the scenario in C# they can’t review. A test like this runs, but it can’t double as a spec: only its author can read it. So the suite and the product’s actual contract quietly drift into two different things, and the test stops being the place you’d go to learn what the product promises.

The insight docfx’s test framework is built on: the scenario is data, and only the runner is code. If you can express “these inputs produce these outputs” as plain data, then one runner can execute all of them, and the data file becomes a spec a non-engineer can read, write, and review.

A test case is an `inputs`/`outputs` block

Here is a complete, real test from docs/specs/markdown.yml:

# Hello World
inputs:
  docfx.yml:
  docs/a.md: Hello `docfx`!
outputs:
  docs/a.json: |
    {
      "conceptual": "<p>Hello <code>docfx</code>!</p>",
      "wordCount": 2,
      "_op_canonicalUrlPrefix": "https://docs.com/en-us/",
      "_path": "docs/a.json",
      "_rel": "../"
    }

That’s the entire test. inputs is a virtual file tree — keys are paths, values are file contents (docfx.yml: with no value is an empty config). outputs is the expected file tree after a build. The runner materializes the inputs on disk, runs the real docfx build, and compares what landed in the output directory against outputs.

Read it again as a reader, not a test author. It says: a file containing Hello `docfx`! becomes a JSON document whose conceptual field is <p>Hello <code>docfx</code>!</p>, with a word count of 2. That’s not a test of an internal function — it’s a statement of what the product does, in a form a docs PM can confirm. The suite is the spec.

Cases stack in one file, separated by --- (YAML documents). One file, dozens of cases:

# title metadata overrides h1
inputs:
  docfx.yml:
  docs/a.md: |
    ---
    title: title from yaml header
    ---
    # Title from H1
    hello
outputs:
  docs/a.json: |
    {
      "conceptual": "<p>hello</p>\n",
      "rawTitle": "<h1 id=\"title-from-h1\">Title from H1</h1>",
      "title": "title from yaml header"
    }
---
# No H1
inputs:
  docfx.yml:
  docs/a.md: |
    ---
    title: Title from yaml header
    ---

    hello
outputs:
  docs/a.json: |
    { "title": "Title from yaml header" }

Notice the second case asserts on a single field. It doesn’t repeat the whole JObject — it pins title and stays silent about everything else. That’s not the runner being lenient by accident; it’s a deliberate diff policy, and it’s what we turn to after the wiring.

One attribute discovers every case

The bridge from “a folder of YAML” to “an xUnit run” is a single attribute on a single method (DocfxTest.cs):

[YamlTest("~/docs/specs/**/*.yml", ExpandTest = nameof(ExpandTest))]
[MarkdownTest("~/docs/designs/**/*.md", ExpandTest = nameof(ExpandTest))]
public static void Run(TestData test, DocfxTestSpec spec)
{
    // ...materialize inputs, run the build, verify outputs...
}

[YamlTest] and [MarkdownTest] come from Yunit, the team’s own xUnit extension for data-driven tests. At collection time Yunit expands the glob, splits each file into its ----separated documents, deserializes each one into the DocfxTestSpec you see as the second parameter, and emits one xUnit test case per document. The # Hello World comment becomes the case’s display name. A MarkdownTest does the same for fenced code blocks inside Markdown design docs — so the design document itself carries executable examples that can’t silently drift from the implementation.

The payoff is the absence of ceremony. Adding a test is adding a YAML block. There is no new method, no [Theory] with a [MemberData] feeding it, no list to register the case in. The glob is the registry. The DocfxTestSpec it deserializes into is a plain options bag — Inputs, Outputs, plus flags like Cwd, Locale, and NoRestore that individual specs set when they need them — so the YAML surface stays small while still reaching the knobs a real build exposes.

One spec, several builds: matrix expansion

A docfx build can run in more than one mode, and the modes are supposed to agree in specific ways. A --dry-run should produce the same error log as a real build but write no artifacts. Building one file at a time should yield the union of what a whole-docset build yields. These are exactly the cross-mode invariants that regress quietly, because nobody writes the same scenario four times by hand.

So the framework writes them for you. The ExpandTest hook named in the attribute takes one spec and returns the set of modes it should run under. In essence (paraphrased):

public static IEnumerable<string> ExpandTest(DocfxTestSpec spec)
{
    yield return "";                       // the normal build, always

    var hasError = spec.Outputs.ContainsKey(".errors.log");
    if (hasError && !spec.DryRunOnly && !spec.NoDryRun)
        yield return "DryRun";             // same errors, zero artifacts

    if (hasError && !spec.NoSingleFile && !spec.BuildFiles.Any()
        && spec.Inputs.Keys.Count(IsContentFile) > 1)
        yield return "SingleFile";         // build each file alone; union must match

    if (InputContainsText(spec, "outputType: pageJson"))
        yield return "ContinueBuild";      // two-phase json → pageJson build
}

Each yielded string becomes another xUnit case for the same YAML spec, tagged with the mode. The runner reads the tag and adjusts: DryRun adds --dry-run and then expects only the error log to match; SingleFile builds each content file in isolation and asserts the union equals the full-build output; ContinueBuild runs the pipeline in two phases — a JSON build, then a --continue pass that emits page JSON — and verifies the final page output matches the spec’s expected .json. The conditions are as load-bearing as the modes. DryRun only expands when the spec has an .errors.log, because the invariant is “dry run reports the same errors.” SingleFile only expands when the spec has more than one content file — there’s nothing to union otherwise — which is why a single-file case like # Hello World never runs it. And ContinueBuild is keyed off the input: it expands only for specs whose config asks for outputType: pageJson. A spec opts out of a mode it can’t satisfy with a flag like NoDryRun.

The result is leverage. The author writes one scenario — “this input produces these errors” — and the suite verifies it under three or four independent execution paths, catching the class of bug where dry-run and real-build silently diverge. The matrix is computed, not copy-pasted, so it can’t fall out of sync with the case it expands.

flowchart TD
    Y["markdown.yml<br/>(one --- document)"] --> EX["ExpandTest(spec)"]
    EX --> M0["mode: '' (normal build)"]
    EX --> M1["mode: DryRun"]
    EX --> M2["mode: SingleFile"]
    EX --> M3["mode: ContinueBuild"]
    M0 --> V["materialize inputs → docfx build → JsonDiff vs outputs"]
    M1 --> V
    M2 --> V
    M3 --> V
    classDef src fill:#dbeafe,stroke:#2563eb,color:#1e3a5f;
    classDef mode fill:#dcfce7,stroke:#16a34a,color:#14532d;
    class Y,EX src;
    class M0,M1,M2,M3 mode;

The assertion is a JSON diff with a policy

String-equal golden files are where these suites go to die. The output JSON has fields that are irrelevant to the case, ordering that isn’t semantically meaningful, and rendered HTML with attributes that change for reasons orthogonal to what you’re testing. Assert on the raw string and every case becomes brittle; authors learn to regenerate expected output without reading it, and the suite stops catching anything.

docfx replaces string equality with a configured JsonDiff (built on Yunit’s diff engine) that knows the domain:

private static JsonDiff CreateJsonDiff()
{
    var fileJsonDiff = new JsonDiffBuilder()
        .UseAdditionalProperties()   // expected may pin a subset of fields
        .UseNegate()                 // assert a field is absent
        .UseWildcard()               // match values you can't pin exactly
        .UseHtml(IsHtml)             // compare HTML structurally
        .Use(IsHtml, RemoveDataLinkType)
        .Build();

    return new JsonDiffBuilder()
        .UseAdditionalProperties(null, IsRequiredOutput)
        .UseIgnoreNull()
        .UseJson(null, fileJsonDiff)
        .UseLogFile(fileJsonDiff)    // sort log lines before comparing
        .UseHtml(IsHtml)
        .Use(IsHtml, RemoveDataLinkType)
        .Build();
}

Each Use* is a rule that makes the diff forgiving about something that shouldn’t count as a difference, while staying strict about everything that should:

UseAdditionalProperties is why the # No H1 case could assert { "title": "..." } and nothing else. Expected pins a subset; unmentioned fields in the actual output don’t fail the case. You assert what the scenario is about.
UseWildcard lets a spec write a placeholder where the exact value is unstable or uninteresting — a generated hash, a path — and still assert the surrounding structure.
UseHtml compares rendered HTML as structure rather than bytes, and the paired RemoveDataLinkType rule strips a data-linktype attribute unless the expectation explicitly asks about it — so a case that doesn’t care about link typing isn’t broken by it, but one that does can still pin it.
UseLogFile sorts the lines of .errors.log before comparing, because error order isn’t part of the contract but the set of errors is.

This is the difference between a golden-file suite that documents behavior and one that merely detects change. A good diff policy encodes what the contract actually constrains — and leaves everything else free to vary.

The build runs for real, against fakes at the seams

The last problem is fidelity. A test that mocks the build’s internals tests the mocks. But a build that really restores git repos and fetches URLs can’t run offline or deterministically in CI. docfx threads this with a single indirection layer, TestQuirks, wired up once in the runner’s static constructor:

TestQuirks.GitRemoteProxy = remote =>
    s_repos.Value is { } repos && repos.TryGetValue(remote, out var local)
        ? local : remote;

TestQuirks.HttpProxy = remote =>
    s_remoteFiles.Value is { } files && files.TryGetValue(remote, out var content)
        ? content : null;

A spec can declare repos and HTTP responses inline; the runner stages them and points the proxies at the fakes. The production code paths run unmodified — the real restore logic, the real fetch logic — but at the one seam where they’d touch the network, they get an in-memory answer instead. AsyncLocal fields hold the per-test fakes so cases stay isolated when xUnit runs them in parallel. Each case also gets its inputs materialized into a real temp docset (git-initialized when the scenario needs history), so “run the real build” means the real build, just hermetic.

The principle generalizes past docfx: mock at the boundary to the outside world, not at the boundary between your own units. The further out you push the fakes, the more real code each test actually exercises — and an integration suite that’s also deterministic and offline is the rare combination that’s worth engineering for.

Where this pattern fits — including your API surface

The reason to care about the mechanics above is that they’re not specific to Markdown. The [YamlTest("~/docs/specs/**/*.yml")] glob in docfx points at one directory, and that directory spans nearly every domain the product has — not just markdown.yml. The public docs/specs tree includes, among others:

Spec file	What its `inputs`/`outputs` pin down
`markdown.yml`	Markdown → conceptual HTML/JSON
`metadata.yml`	page and document metadata resolution
`xref.yml`	cross-reference resolution (`@uid` → link/title)
`schema.yml`	structured (schema-driven) document validation
`toc/`	table-of-contents construction and linking
`validation/`	content/link/alias validation rules

Every one of these is the same inputs: → outputs: shape, run by the same runner, diffed by the same policy. The framework didn’t need a new harness per domain; it needed one runner and a folder convention. That’s the real adoption story: once “scenario is data, runner is code” exists, each new subsystem joins by dropping a YAML file into the glob.

API and reference generation is the natural next adopter, and docfx already tests it this way. Turning a metadata model or source into reference pages, resolving @uid cross-references, emitting a .yml/JSON model is a deterministic input-tree → output-tree transform — exactly this pattern’s shape. xref.yml asserts that a uid resolves to a specific link and display title; metadata.yml asserts the resolved metadata for a document. The spec is simply: here is the input, here is the exact reference output it must produce — pin the fields the contract guarantees (uid, name, signature, the resolved href), wildcard the volatile ones, ignore the rest.

More broadly, reach for this pattern when all of these hold:

The unit under test is a transform: structured input in, structured (text/JSON/HTML/YAML) output out.
The output is deterministic given the input — or can be made deterministic by mocking the few external seams (clock, network, git, credentials), as docfx does with TestQuirks.
The interesting variation lives in the data, not the control flow — many scenarios that differ in input and expected output, not in setup logic.
The people who know whether the output is correct aren’t always the people who write C# — API doc owners, content authors, schema designers.

API/reference generators, template/renderer engines, serializers and formatters, compilers and transpilers, config resolvers, link/redirect resolvers, lint/validation rules — all fit. Where it doesn’t fit is the contrast in the next section.

Honest limits

This design is excellent for what it is, and it isn’t free.

It’s an integration suite, with integration-suite costs. Each case spins up a real build over a real temp docset. That’s slower than a unit test and harder to debug when it fails — a red case tells you the output diverged, not which of the build’s many stages caused it. The fidelity you buy is paid for in execution time and in failure-localization.
A data-driven runner hides control flow. When a spec misbehaves, the stack trace lands in the generic runner, not in anything resembling the YAML you wrote. You debug by correlating a mode tag and a file path back to a --- block. The ceremony you saved at authoring time partly returns at debugging time.
The diff policy is itself code that can be wrong. Every forgiving rule — wildcards, subset matching, HTML normalization — is a place a real regression can hide. A too-lenient policy passes output it shouldn’t; tuning it is ongoing judgment, not a one-time setup.
Expected output is generated, and generated goldens invite rubber-stamping. The suite is only as honest as the reviewer who reads the expected JSON instead of regenerating it on red. The format makes careful review possible — readable YAML, pinned subsets — but it can’t make it happen.
YAML caps the expressible. Scenarios that need genuine programmatic setup don’t fit the inputs/outputs mold and still want hand-written tests. The framework is a sharp tool for one common shape, not a universal replacement for the test class.

If you take only three things

Make the test case data, and only the runner code. When a scenario is a YAML block instead of a method, the people who know whether the expected behavior is correct can read and write the suite — and the test file becomes a spec that can’t silently drift from the product. The boilerplate you delete is boilerplate that can’t rot.
Compute the matrix; don’t copy-paste it. Cross-mode invariants (dry-run matches real, per-file matches whole-docset) regress precisely because nobody re-writes the scenario for each mode. Expand one case into many at collection time and the invariants are checked for free, forever in sync with the case.
Put the policy in the diff, not in the expectation. A golden-file suite lives or dies on its comparison. Encode what the contract actually constrains — pin the fields that matter, normalize the noise that doesn’t — and you get a suite that documents behavior instead of one everybody learns to regenerate without reading.

Source: notes/specs-as-tests-docfx-yaml-testing.md @ 15cd198