manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342
Draft
Simon (simonhj) wants to merge 10 commits into
Draft
manifest/bazel: nested-workspace + Bazel-native Maven extraction#1342Simon (simonhj) wants to merge 10 commits into
Simon (simonhj) wants to merge 10 commits into
Conversation
…ub-workspace discovery The existing bazel-query discovery path only inspects MODULE.bazel / WORKSPACE at the invocation cwd. Ruleset repos with per-example sub-workspaces (rules_kotlin/examples, rules_js/examples, rules_rust, rules_python) declare additional Maven artifacts in nested MODULE.bazel projects with their own maven_install.json lockfiles. Those files were silently dropped, leaving the CLI's SBOM a strict subset of what the server-side depscan parser already returns from the same tree. Add a walker that finds every checked-in maven_install.json under cwd (pruning .git, node_modules, .socket-auto-manifest, and Bazel's bazel-* convenience symlinks into <output_base>), parses each via the existing parseUnsortedDepsJson v2-lockfile path, and merges the artifacts into the SBOM after the bazel-query extraction step. Merge is keyed by mavenCoordinates so the root workspace's lockfile (which bazel-query already extracts) does not double-count; conflicting group:artifact versions across sub-workspaces continue to surface as the existing loud-failure error in normalizeToMavenInstallJson. Verified against bazel-bench/oss/rules_kotlin: walker now surfaces all 10 examples/*/maven_install.json files and merges 393 unique artifacts into the SBOM beyond what the root @kotlin_rules_maven discovery returns. No regression on tink-java (0 lockfiles) or protobuf (1 root lockfile, deduped against bazel-query's @maven extraction).
…er walker already covers it The CLI was walking the tree for **/maven_install.json and **/*_maven_install.json lockfiles and merging them into its output. The server-side scan walker matches the same pattern natively via getReportSupportedFiles, so the CLI re-reading these files duplicated work and produced output that was a strict subset of what the walker already saw when the scan was uploaded. Removes: - bazel-lockfile-discovery.mts (196 lines) - bazel-lockfile-discovery.test.mts (241 lines) - extract_bazel_to_maven step 5b (33 lines): the merge-back-into-allArtifacts loop The .socket-auto-manifest/maven_install.json the CLI emits is still picked up by the same walker — that composition stays intact. After this change the CLI emits only what running bazel produces (the complement of the walker's lockfile coverage).
…very `findWorkspaceRoots` walks the tree from cwd and returns every directory containing MODULE.bazel / WORKSPACE / WORKSPACE.bazel. Monorepos host multiple workspace roots (e.g. examples/<name>/MODULE.bazel, mobile/ MODULE.bazel under an otherwise non-Bazel root); the per-workspace algorithm in the orchestrator runs once per discovered root. Pruning matches the previous lockfile walker: skip the usual non-workspace directories (.git, node_modules, .socket-auto-manifest, etc.), Bazel's `bazel-*` output_base symlinks (so we never recurse into tens of GiB of generated state), and `dist*` build-output directories. Caps `MAX_WALK_DEPTH` and `MAX_WORKSPACE_ROOTS` guard against pathological inputs and symlink loops. Pure-function module with no Bazel calls; unit tests use a tmpdir fixture tree and cover the root-only, nested, prune, symlink, and sort-determinism cases.
…+ probe primitives
Drop all static parsing of MODULE.bazel / WORKSPACE / *.bzl sources.
Bazel itself sees those files via `mod show_extension` and `cquery`; the
CLI no longer needs to interpret Starlark.
`parseShowExtensionOutput` consumes the text-format report from
bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
and returns the hub repos (items annotated with `(imported by ...)`).
Generated per-artifact bullets are skipped; `DEBUG:` / `WARNING:` lines
are tolerated; the parser stops at the next `## ` section header so
multi-extension reports don't cross-contaminate.
`classifyProbeResult` turns a raw probe outcome into a tri-state status:
- populated: code=0 + non-empty stdout
- empty: code=1 + "no targets found beneath"
- not-defined: code=1 + "No repository visible" / "no such package",
or code=0 + empty stdout (WORKSPACE-mode silent miss)
The orchestrator treats `empty` and `not-defined` uniformly as skips; the
distinction is preserved for the sidecar status report.
`CONVENTIONAL_MAVEN_REPO_NAMES` exposes the names the legacy WORKSPACE
path probes (`maven`, `maven_install`, `maven_dev`, `unpinned_maven`,
`maven_unpinned`). `--bazel-maven-repo=` extras are appended by the
orchestrator (sibling todo).
Deleted exports: `parseMavenRepoCandidates`, `parseVisibleRepoCandidates`,
`validateMavenRepo`, `discoverMavenRepos`. Their replacements live in the
new primitives above; the orchestrator rewrite that wires them up lands
in a follow-up layer. `extract_bazel_to_maven.mts` does not typecheck
in this intermediate state — fixed in the orchestrator commit.
Tests cover the parser fixture (hub vs generated, separator variants,
multi-section reports), the tri-state classifier (every documented
input), and the verbose-logging contract for `probeCandidate`.
…tate probe
bazel-query-runner now centralises startup-flag construction so every
spawn — query, cquery, mod show_extension, mod dump_repo_mapping —
threads `--bazel-rc`, `--output_user_root`, and `--output_base`
consistently. The new optional `outputUserRoot` field on
`BazelQueryOptions` is the Maven path's hook for per-invocation server
isolation; the orchestrator (next commit) mkdtemp's a fresh path and
will reap the server via `bazel shutdown` + `rm -rf` on success and on
timeout, so timed-out servers no longer leak across CLI invocations.
Add `runBazelModShowMavenExtension`: invokes
bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven
to enumerate Maven hubs directly from the rules_jvm_external extension
report, replacing the over-enumerating `dump_repo_mapping` surface on
the Maven path. `runBazelModShowVisibleRepos` is kept around for the
legacy PyPI extractor, which has not been rescoped yet.
Replace the Maven-side `buildProbeFor` (which emitted a kind-only
`kind("jvm_import rule|aar_import rule", @repo//:*)` query) with
`buildMavenProbeFor`, a lightweight `cquery '@<name>//... --output=label
--keep_going'` presence check whose result feeds the new tri-state
classifier in bazel-repo-discovery. Kind-only filtering missed
POM-only / native / AAR-without-aar_import artefacts and any future
rules_jvm_external rule shape; the metadata filter is now applied by
the per-repo extraction cquery (next layer), not by the probe.
Update `buildPypiProbeFor`'s return shape to include stderr so it
satisfies the new `RepoProbe` type contract. Move
`parseVisibleRepoCandidates` and the `ValidationResult` type into
bazel-pypi-discovery (their only remaining consumer); the Maven module
no longer carries dump_repo_mapping-shaped code.
Tests cover the new argv shapes for every spawn surface, the
outputUserRoot startup-flag placement (before subcommand), the
Maven probe argv (cquery + @repo//... + --output=label + --keep_going),
and the full result-triple propagation (code/stdout/stderr) that the
tri-state classifier needs.
`runMetadataCqueryForRepo` executes the per-repo extraction cquery and
returns a structured outcome (`ok` / `partial` / `timeout` / `empty` /
`error`) so the orchestrator can populate sidecar status without
custom error plumbing per call site. The cquery target expression is
the union of three predicates — `attr("tags", "\bmaven_coordinates=",
...)`, `attr("maven_coordinates", ".+", ...)`, and `attr("maven_url",
".+", ...)`. That matches rules_jvm_external's `jvm_import` /
`aar_import` shapes, Bazel-native `java_library` with direct
`maven_coordinates`, and POM-only / source-jar shapes that carry only
`maven_url`. Word-boundary `\b` in the tags predicate prevents matches
on values like `pre_maven_coordinates=fake`.
`parseCqueryJsonproto` is defensive about the jsonproto encoding:
dispatches on `attribute[].type`, accepts both camelCase
(`stringValue`, `stringListValue`) and snake_case (`string_value`,
`string_list_value`) payload keys, and tolerates both the Bazel 5+
envelope shape (`{ "results": [{ "target": {...} }] }`) and the older
per-line streamed shape. Coordinate extraction prefers the direct
`maven_coordinates` attribute; falls back to scanning `tags` for
`maven_coordinates=G:A:V`. Provenance lands in `sourceRepo` as
`<workspace-rel-path>:<repoName>` (or just `<repoName>` at the root),
so the orchestrator's dedup can attribute artifacts back to their
discovery site.
Timeout handling: spawn rejections with `timedOut` / `killed` /
`SIGTERM` / `SIGKILL` map to `status: 'timeout'`. The runner does NOT
delete the outputUserRoot — server lifecycle (reap via
`bazel shutdown` + `rm -rf`) is the orchestrator's concern so that a
single tempdir can hold multiple per-repo runs.
Also widen `ExtractedArtifact.ruleKind` from the literal
`'jvm_import' | 'aar_import'` union to `string`. The legacy text-format
parsers only ever set those two values, but the metadata cquery
returns whatever `ruleClass` Bazel reports (`java_library`,
`kt_jvm_import`, any future rules_jvm_external rule). Existing
consumers only read the field diagnostically; nothing else changes.
Tests cover the parser (envelope, per-line stream, snake_case
fallback, direct-vs-tag preference, missing-coordinate skip, empty
input), the argv builder (target expression union, startup-flag
placement, `--bazel-flag` placement, invocationFlags order), and the
runner's status classification including the spawn-timeout branch.
…thm in a tree walk
`extractBazelToMaven` now walks the scan root for every workspace
(MODULE.bazel / WORKSPACE / WORKSPACE.bazel) and runs the per-workspace
extraction algorithm in each one. Monorepos like rules_kotlin
(examples/<name>/MODULE.bazel) and projects with mobile sub-workspaces
(mobile/MODULE.bazel under a non-Bazel root) are no longer
silently dropped to the root-only path.
Per workspace:
1. Detect Bzlmod vs WORKSPACE mode.
2. Discover candidate Maven hubs:
- Bzlmod: bazel mod show_extension @rules_jvm_external//:extensions.bzl%maven,
parsed via parseShowExtensionOutput.
- WORKSPACE (or Bzlmod fallback): probe the conventional names
(maven, maven_install, maven_dev, unpinned_maven, maven_unpinned)
plus any customer-supplied extras via the tri-state classifier.
3. Per populated candidate: run the metadata cquery
(`attr("tags", "\bmaven_coordinates=", @<repo>//...)` ∪ direct
`maven_coordinates` / `maven_url` attrs) and accept the parsed
artefacts.
4. Aggregate, then dedup across workspaces by full Maven coordinate.
Server isolation is now invariant: every Bazel invocation runs under a
per-CLI-call --output_user_root=<tempdir>. On per-repo cquery timeout
the orchestrator reaps the server (`bazel shutdown`) and `rm -rf`'s the
tempdir, then mints a fresh one for subsequent repos — a single bad
hub no longer cascades into the rest of the run. The finally-block
cleanup reaps every tempdir that was minted, including the last one.
Sidecar `manifest-status.json` lands beside the synthesized
`maven_install.json`. Each entry records the repo's classified status
(ok / partial / timeout / empty / error), artifact count, and duration,
so the server-side can surface partial results to the customer. The
top-level `complete: false` flag fires iff any repo timed out.
Deleted: the unsorted_deps.json fast path (`extractFromOneRepo`,
`bazelExternalDir`, `isForceQueryFallbackEnabled` env knob) — the
metadata cquery returns the same GAVs the fast path used to recover,
without depending on bazel-out symlinks or generated artefacts.
Deleted: the lockfile merge (already done in a previous commit on this
branch); deleted: the kind-only probe and dump_repo_mapping enumeration.
The orchestrator's `ExtractBazelOptions` now accepts
`extraMavenRepoNames` (legacy WORKSPACE non-conventional hub names) and
`perRepoTimeoutMs` (per-repo cquery cap). The CLI flag wiring lands in
a sibling commit; existing call sites continue to pass the same fields
they did before.
Existing `extract_bazel_to_maven.test.mts` is pinned to the old
unsorted_deps fast path and is replaced wholesale in the next commit
(test layer).
…e pipeline The previous tests pinned the legacy unsorted_deps.json fast path, kind-only probes, and dump_repo_mapping enumeration. The new tests mock the orchestrator's three external collaborators — findWorkspaceRoots, runBazelModShowMavenExtension, runMetadataCqueryForRepo — and assert on the contract that matters: end-to-end Bzlmod and WORKSPACE-mode flows, the per-repo cquery loop, cross-workspace coordinate dedup, the timeout → re-mint loop, sidecar `manifest-status.json` shape, and `extraMavenRepoNames` threading. Pure-function `normalizeToMavenInstallJson` keeps a focused trio of unit tests (dedup, version-conflict, sha256-preservation). The fixture-driven .socket.facts.json non-emission assertion stays so the Maven-path-vs-facts-path invariant is exercised. Also patch the PyPI test mock: parseVisibleRepoCandidates moved from bazel-repo-discovery to bazel-pypi-discovery in a previous commit, so the test's vi.mock now mirrors the actual export surface. The probe fixture grows a `stderr` field to match the new RepoProbe contract.
…GNORED_DIRS `findWorkspaceRoots` no longer hardcodes the directory-prune set — callers pass `ignoreDirNames: ReadonlySet<string>` and `ignoreDirPrefixes: readonly string[]` via options. Neither defaults to anything; absent means no pruning. This keeps the walker decoupled from any particular ignore policy and avoids duplicating the codebase-wide `IGNORED_DIRS` list. `src/utils/glob.mts` exports `IGNORED_DIRS` so the orchestrator can compose it with Bazel-specific extras. The orchestrator's composed set: `IGNORED_DIRS` plus `.hg`, `.idea`, `.pnpm-store`, `.socket-auto-manifest`, `.svn`, `.vscode`; prefixes `bazel-` and `dist`. Also tighten `MAX_WALK_DEPTH` from 16 → 8. Deepest workspace marker observed across the surveyed OSS corpus is 9 (bazel-self test fixtures); deepest in realistic application code is 7 (checkmk's thirdparty layout). The cap gives one level of headroom over the realistic max while still guarding against pathological symlink loops that slipped past any prefix prune the caller supplied. Walker test rewritten against the new injected API: covers the no-prune-by-default case (`node_modules/MODULE.bazel` surfaces unless the caller ignores `node_modules`), injected name and prefix prunes, and the bazel-* symlink case under the prefix injection.
20957bc to
23e2f96
Compare
No consumer reads it today. The orchestrator still tracks per-repo timeouts to decide ExtractBazelResult.ok and to reap+remint the output_user_root, but no longer serialises the per-workspace / per-repo status report to disk.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rewrites
socket manifest bazel's Maven extraction pipeline so it (a)discovers every workspace under the scan root, not just
cwd, and (b)relies on Bazel-native commands for repo enumeration instead of static
Starlark regex parsing.
Server isolation
Every Bazel invocation runs under a per-CLI-call
--output_user_root=<tempdir>. On per-repo cquery timeout theorchestrator reaps the server (
bazel shutdown) andrm -rfs thetempdir, then mints a fresh one for subsequent repos. The
finally-block cleans up every tempdir that was minted. A single
hostile
repository_ruleno longer cascades into the rest of the run.New modules
bazel-workspace-walk.mts— pure-function workspace walker withinjected prune policy (
ignoreDirNames,ignoreDirPrefixes).MAX_WALK_DEPTH = 8(corpus survey: deepest realistic applicationlayout is 7; bazel-self test fixtures hit 9). The orchestrator
composes the codebase-wide
IGNORED_DIRSfromsrc/utils/glob.mtswith Bazel-specific extras (
bazel-*,dist*,.socket-auto-manifest, plus VCS/IDE dirs).bazel-cquery.mts— per-repo metadata cquery + defensivejsonproto parser (dispatches on
attribute[].type; accepts bothcamelCase
stringValue/stringListValueand snake_casestring_value/string_list_value; tolerates Bazel 5+ envelope andolder per-line streamed shapes).
Rewritten modules
bazel-repo-discovery.mts— drops the entire Starlark regexparser (
USE_REPO_RE,MAVEN_INSTALL_NAME_RE,parseMavenRepoCandidates,listLegacyStarlarkFiles,safeReadFile,parseVisibleRepoCandidates,validateMavenRepo,discoverMavenRepos). New primitives:parseShowExtensionOutput,classifyProbeResult,probeCandidate,CONVENTIONAL_MAVEN_REPO_NAMES.bazel-query-runner.mts— centralises startup-flag construction(
--bazelrc/--output_user_root/--output_base). DropsbuildProbeFor(kind-only probe). AddsrunBazelModShowMavenExtensionandbuildMavenProbeFor(lightweight presence-check cquery feeding the tri-state
classifier).
parseVisibleRepoCandidatesmoved tobazel-pypi-discovery.mts(its only remaining consumer).extract_bazel_to_maven.mts— wraps the per-workspace algorithmin a tree walk. Drops the
unsorted_deps.jsonfast path (themetadata cquery returns the same GAVs without depending on
bazel-outsymlinks or generated artefacts) and the lockfilemerge-back loop (server walker handles it).
Sidecar shape
.socket-auto-manifest/manifest-status.json:{ "complete": true, "workspaces": [ { "relPath": "", "mode": { "bzlmod": true, "workspace": false }, "repos": [ { "name": "maven", "status": "ok", "artifactCount": 118, "durationMs": 28213 }, { "name": "maven_dev", "status": "empty", "artifactCount": 0, "durationMs": 102 } ] } ] }complete: falsefires iff any repo timed out.