Skip to content

meta: profiling-driven build performance burndown (Docker Linux, NightDriverStrip, cold vs hot cache) #942

Description

@zackees

Context

fbuild's end-to-end wall clock on a real project has never been measured systematically under a controlled, reproducible environment. We have anecdotal slow spots (toolchain download/extract, library resolution, sequential install steps) but no profile data separating on-CPU work from off-CPU waiting (network, disk, subprocess, lock contention).

Existing pieces this effort builds on:

  • FBUILD_PERF_LOG=1 env-gated phase timing in crates/fbuild-build/src/perf_log.rs (from perf(build): investigate warm-pass compilation stall — 30s where cache says <1s #91) — coarse per-phase wall clock, emitted via tracing + stderr.
  • fbuild-packages already has a parallel download pipeline; unknown how well it overlaps download → extract → install → compile in practice.
  • Embedded zccache service (perf(zccache): migrate from managed wrapper binary to embedded ZccacheService in fbuild-daemon #789) covers compiler-invocation caching; this effort targets everything around the compiler.
  • The NightDriverStrip benchmark fixture is referenced by the ignored test build_nightdriverstrip_demo in crates/fbuild-build/tests/esp32_build.rs (expects tests/NightDriverStrip/); the checkout is not committed, so the Docker harness must clone it.
  • zackees/soldr has the Docker + script reference infrastructure (docker/cook-shared-cache/Dockerfile, perf/, PERF.md) to model the fast-rebuild container on.

Proposal

Stand up a reproducible Linux Docker profiling harness, measure cold-cache and hot-cache builds of NightDriverStrip, and use the data to drive a burndown of optimization sub-issues until fbuild's non-compiler overhead is as close to fully overlapped/concurrent as possible.

Phase 0 — Docker profiling harness

  • Dockerfile optimized for fast image rebuilds (layer-cached toolchain/deps, source COPY last, or bind-mount + named volumes per the soldr pattern), based on zackees/soldr's docker + script infrastructure.
  • fbuild's own cache (~/.fbuild/cache) is not persisted between container runs — cold cache means genuinely cold (fresh downloads).
  • Clones NightDriverStrip as the benchmark workload (ESP32 demo env; optionally demo_c6 for a RISC-V data point).
  • Orchestration script runs: (a) cold-cache build, (b) hot-cache rebuild (same container, cache intact), each N≥3 times, and archives all profiles/logs as artifacts.

Phase 1 — Instrumentation + profiling

  • Event logging on: FBUILD_PERF_LOG=1 plus extending perf_log.rs/tracing spans wherever coverage is missing (download start/end per package, extract, install, per-TU compile dispatch, archive, link, daemon RPC round-trips).
  • On-CPU profiling: perf record -g (or samply/cargo flamegraph) over daemon + CLI → flamegraphs.
  • Off-CPU (async) profiling: off-CPU flamegraphs (perf sched / eBPF offcputime) and/or tokio-console-style async task instrumentation to expose where the pipeline waits — network, disk, subprocess, serialized stages, lock contention.

Phase 2 — Analysis → sub-issues

  • Produce a cold-vs-hot phase breakdown table (wall clock per stage, % of total).
  • For every stage that is (a) not cached when it should be, (b) serialized when it could overlap with download/install/compile/link, or (c) hot on-CPU in fbuild's own code: file a child sub-issue with the profile evidence attached.

Phase 3 — Optimization burndown

  • Expected themes (to be confirmed by data, not assumed): overlap download ⇄ extract ⇄ install ⇄ first compiles; start compiling TUs whose deps are ready before the full install finishes; overlap archive/link prep with trailing compiles; cache anything recomputed on hot builds (config parse, library selection, header scan); remove sync-in-async stalls (continuing audit: sync code that could be async in fbuild-cli + fbuild-python (sub-issue of #813) #817).
  • Out of scope: compile settings. No changes to compiler/linker flags, optimization levels, or codegen — stock settings stay stock. Everything around the compiler is fair game.
  • Each optimization ships as its own PR against its sub-issue, with before/after numbers from the Phase 0 harness in the PR description.

Phase 4 — Verification + close-out

  • Re-run the harness after each merged PR; final report of cold and hot wall-clock deltas vs the Phase 1 baseline on this issue.

Acceptance criteria

  • Docker harness merged (Dockerfile + run script) that produces cold-cache and hot-cache builds of NightDriverStrip with fbuild cache not persisted between docker runs.
  • On-CPU and off-CPU profiles + FBUILD_PERF_LOG event timelines captured for both cold and hot runs and attached to this issue.
  • Baseline numbers posted: cold and hot wall clock with per-phase breakdown (median of ≥3 runs).
  • Every identified slow/uncached/serialized path has a child sub-issue linked from a task list on this issue.
  • All child sub-issues resolved via merged PRs, each PR showing before/after harness numbers.
  • Final cold + hot wall-clock comparison vs baseline posted; all sub-issues closed; this issue closed.

Decisions

  • Benchmark workload: NightDriverStrip demo env (ESP32/Xtensa) — it's the fixture the existing ignored integration test already targets and the heaviest real-world project we've built.
  • Fixture provisioning: Docker harness clones NightDriverStrip at a pinned commit rather than committing the tree to this repo — keeps the repo lean, keeps runs reproducible.
  • Profiler choice: perf + flamegraphs for on-CPU, off-CPU flamegraphs (perf sched/eBPF) for waits — standard Linux tooling that works in a container; exact tool swap is fine if the harness PR finds something better.
  • Harness location: ci/docker-profile/ alongside the existing ci/docker-* dirs; scripts in Python via uv run per the language policy (CI scripting is the sanctioned Python use).
  • Sub-issue granularity: one issue per independently-mergeable optimization, all linked from a task list here — matches the meta-issue pattern used by meta: broker-safe global artifact repository feature burndown #603.
  • Priority: P2 — significant DX win, nothing shipping is blocked on it.
  • Compile settings frozen: stock flags only; performance must come from caching, concurrency, and pipeline overlap.

Related issues

Burndown (Phase 2 task list)

Correctness blockers found by the harness (done):

Harness:

Optimization sub-issues (from the Phase-1 baseline):

Metadata

Metadata

Assignees

No one assigned

    Labels

    trackingUmbrella or tracking issue

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Triage

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions