perf: speed up manifest JSON rendering#874
Closed
He-Pin wants to merge 7 commits into
Closed
Conversation
Motivation: std.manifestJson* still contributed to the local Scala Native gap versus source-built jrsonnet, especially in real-world object-heavy rendering. Modification: Add an internal StringBuilder-backed FastMaterializeJsonRenderer for std.manifestJson, std.manifestJsonMinified, and std.manifestJsonEx while preserving the public MaterializeJsonRenderer StringWriter API. Reuse an in-place codepoint key sorter backed by java.util.Arrays.sort, and fix raw-surrogate prefix ordering in compareStringsByCodepoint. Result: Full validation passed: ./mill --no-server --ticker false --color false __.reformat and ./mill --no-server --ticker false --color false -j 1 __.test reported 451/451 tests passing. JMH regressions: manifestJsonEx 0.055 ms/op, realistic2 43.596 ms/op, gen_big_object 0.842 ms/op. Direct hyperfine against source-built jrsonnet: manifestJsonEx sjsonnet-native 5.090 ms vs jrsonnet 4.075 ms; kube-prometheus sjsonnet-native 143.738 ms vs jrsonnet 97.385 ms.
da92dd1 to
c3581e8
Compare
Motivation: The JVM/char render hot path (BaseCharRenderer.visitNonNullString) ran a CharSWAR.hasEscapeChar scan on every string, even for Val.AsciiSafeStr which is statically known to need no JSON escaping (chars 0x20-0x7e, no quote/backslash). The Native ByteRenderer already had this bypass; the char path did not. Modification: - Add BaseCharRenderer.visitAsciiSafeString: quote + bulk getChars + quote, correct even under escapeUnicode since all chars are <= 0x7e. - Route Val.AsciiSafeStr through it via a Materializer.visitStr helper at the three value-string sites; ujson.Value AST path falls back to visitString. - Add AsciiSafeRenderBenchmark to isolate the render path for A/B. Result: JMH render-only, 335KB string-heavy output: 1.606 -> 1.441 ms/op (-10.3%, non-overlapping error bands). 450/450 tests pass.
Motivation: std.manifestTomlEx routed through java.io.StringWriter, whose backing StringBuffer pays a monitor enter/exit on every write/flush on the hot TOML manifestation path. The JSON renderer already switched to the unsynchronized StringBuilderWriter in databricks#874 (-9.3% on kube-prometheus native); TOML did not. Modification: - Switch TomlRenderer and the manifestTomlEx render path in ManifestModule from java.io.StringWriter to the package-private StringBuilderWriter. Output is byte-identical. std.deepJoin keeps StringWriter (separate concern). - Add TomlRenderBenchmark to A/B the render path. Result: Native hyperfine, TOML-heavy workload (1.79MB output): after ran 1.11 ± 0.07x faster than before (~10%), output byte-identical. JMH (whole-pipeline) showed AFTER < BEFORE in two independent rounds. 450/450 tests pass.
Motivation: Parser.Pos is invoked for nearly every AST node. It was `Index.map(off => new Position(...))`: fastparse's `Index` stores the offset as an Int in its `successValue: Any` field (boxing it), and the `.map` then unboxes it and allocates a closure — per node. boxToInteger via SharedPackageDefs.Index was a top self-frame in the parse flamegraph on kube-prometheus. Modification: - Rewrite Pos to write the Position object straight into successValue via ctx.freshSuccess(new Position(fileScope, ctx.index)), skipping the Int box/unbox and the map closure. Parse output (positions/errors) is unchanged. Result: JMH ParserBenchmark (parse-only, all test-suite files): 1.669 -> 1.579 ms/op (+5.4%, non-overlapping bands). Native parse_time on kube-prometheus: ~105.6 -> ~100.9 ms (+4.5%, consistent). Output byte-identical. 450/450 tests pass.
Motivation:
exprSuffix2 was `Pos.flatMapX { i => CharIn(".[({")... }`, which allocated a
Position on EVERY attempt — including the failing attempt that terminates
`exprSuffix2.rep` after each expression. Most subexpressions have no suffix, so
that trailing failed attempt (one per expression) allocated a Position that was
immediately discarded.
Modification:
- Match the suffix char first; allocate `new Position(fileScope, ctx.index - 1)`
only inside the matching branch. No suffix -> CharIn fails fast, no Position.
Also drops the `.map(_(0))` Char step. Parse output (positions/errors) is
unchanged.
Result:
JMH ParserBenchmark (-f0, same-session): 1.560 -> 1.530 ms/op (+1.9%). Native
parse_time on kube-prometheus: non-regressing, min/p25 ~2% lower (noise-limited
on a loaded machine). Output byte-identical. 517/517 tests pass.
Motivation: std.manifestJson* render fully in memory via FastMaterializeJsonRenderer. The inherited flushCharBuilder spilled the CharBuilder to the output writer at every sub-tree boundary, adding buffer-to-buffer copies that are pure overhead when the whole document is built in memory and emitted once. Modification: - Override flushCharBuilder to write out only when depth == 0 (root finished); accumulate everything in elemBuilder until then. - Size StringBuilderWriter's initial buffer at 4096 (was 16) to cut early reallocations, and mark it private[sjsonnet]. Result: Fewer intermediate copies on the manifestJson* path; output byte-identical.
…Chars ascii mask Adds regression coverage: - object_remove_key_directional: objectRemoveKey interaction with super / addSuper (`a+:`) merge and inline addSuper asserts. - strip_chars_ascii_mask_directional: stripChars over the ASCII range.
Contributor
Author
|
Superseded — split into focused, independently-measured PRs off current master (each output byte-identical, no benchmark code):
The manifest-JSON rendering work this PR was based on is already in master (da92dd1). Closing in favor of the smaller PRs above. |
This was referenced Jun 3, 2026
stephenamar-db
pushed a commit
that referenced
this pull request
Jun 3, 2026
## Motivation `std.manifestTomlEx` had three sources of avoidable overhead on the hot manifestation path: 1. **Synchronized writer.** `TomlRenderer` and `ManifestModule.evalRhs` rendered into a `java.io.StringWriter`, whose backing `StringBuffer` pays a monitor enter/exit on every `write`/`flush`. The `FastMaterializeJsonRenderer` already uses the unsynchronized `StringBuilderWriter` (#874); TOML did not. 2. **Redundant field lookups in `renderTableInternal`.** Each key's `Val.Obj.value(k)` was resolved twice — once to classify scalar vs section, then again to render or recurse. The cache deduplicates the result, but the lookup itself still costs. 3. **Wasted indexing work.** `visibleKeyNames` was iterated and each key binary-searched back into `sortedVisibleKeyNames` — `sortedVisibleKeyNames` can be iterated directly, skipping `O(n log n)` compares per table. ## Modification Two commits: - **`perf: use unsynchronized StringBuilderWriter in TomlRenderer`** — Swap `TomlRenderer` and the `manifestTomlEx` render path in `ManifestModule` from `java.io.StringWriter` to the package-private `StringBuilderWriter`. `std.deepJoin` keeps `StringWriter` (separate concern). - **`perf: cache resolved field values and skip binary search in renderTableInternal`** — Resolve each field once into a `resolved: Array[Val]` during section classification and reuse it during render/recurse; iterate `sortedVisibleKeyNames` directly (removes the now-unused `sortedKeyIndex` binary search); hoist `childIndent = cumulatedIndent + indent` out of the section loop (was an identical allocation per sibling section); pre-size the output `StringBuilderWriter` to 1 KiB so small/medium outputs skip the first ~6 doublings. Output is byte-identical (verified at 1,228,186 bytes on the benchmark workload). ## Result Scala Native, hyperfine A/B against `master` (`fc292fa6`). Workload: object comprehension over 8000 small tables → ~1.2 MB TOML output (render-dominated). Four interleaved-order passes, `--warmup 10 --min-runs 100 --shell=none`: | pass | order | before mean | after mean | before min | after min | **min ratio** | |---|---|---:|---:|---:|---:|---:| | 1 | before → after | 59.4 ± 2.7 ms | 53.2 ± 23.4 ms | 55.4 ms | 43.8 ms | **1.27×** | | 2 | after → before | 64.1 ± 7.7 ms | 51.8 ± 12.2 ms | 56.4 ms | 43.7 ms | **1.29×** | | 3 | before → after | 64.1 ± 8.1 ms | 53.2 ± 14.3 ms | 56.4 ms | 42.0 ms | **1.34×** | | 4 | after → before | 63.3 ± 14.3 ms | 49.2 ± 3.7 ms | 57.2 ms | 42.8 ms | **1.34×** | Mean is noisy on the host (1.12× – 1.29×), but **after is faster in every one of the 4 passes** and the **min values are tight at ~1.27–1.34× faster** (best observed: 42.0 ms vs 56.4 ms, ~25.5% reduction). Output byte-identical, 1,228,186 bytes both sides. For comparison, the StringBuilderWriter swap alone (commit 1) measures ~1.08–1.14× min; the cache + binary-search elimination + childIndent hoist (commit 2) lifts that to ~1.27–1.34× min. ## Test plan - [x] `./mill __.reformat` - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 519/519 pass - [x] Scala Native A/B hyperfine — 4 interleaved-order passes, all positive; output byte-identical --- > Rebased onto current `master` (`fc292fa6`). The companion commit "speed up manifest JSON rendering" was merged separately as #879, so this PR now contains only the TomlRenderer / ManifestModule changes.
stephenamar-db
pushed a commit
that referenced
this pull request
Jun 3, 2026
## Motivation `std.deepJoin` writes each `Val.Str` chunk into a `java.io.StringWriter` inside a tight loop. `StringWriter`'s backing `StringBuffer` pays a monitor enter/exit on every `write`/`append` call, which on a typical deepJoin walk over a deeply nested array can be hundreds of thousands of synchronized writes — wasted overhead in single-threaded jsonnet evaluation. `TomlRenderer` and `FastMaterializeJsonRenderer` already use the unsynchronized package-private `StringBuilderWriter` for the same reason (#874, #875). `std.deepJoin` was explicitly left as a follow-up in #875's description (*"std.deepJoin keeps StringWriter (separate concern)"*) — this PR is that follow-up. ## Modification Single change in `ManifestModule.scala`: swap the `new StringWriter()` in `DeepJoin.evalRhs` for `new StringBuilderWriter()`. No other code changes; output is byte-identical. ## Result Scala Native, hyperfine A/B against `master` (`fc292fa6`). Workload: a 50,000-row array of 10 pre-allocated strings → 2 MB of `deepJoin` output, render-dominated. Four interleaved-order passes, `--warmup 10 --min-runs 100 --shell=none`: | pass | order | before mean | after mean | before min | after min | **min ratio** | |---|---|---:|---:|---:|---:|---:| | 1 | before → after | 35.1 ± 16.5 ms | 32.2 ± 19.1 ms | 23.1 ms | 18.7 ms | **1.24×** | | 2 | after → before | 43.7 ± 30.6 ms | 29.9 ± 25.3 ms | 25.7 ms | 20.3 ms | **1.27×** | | 3 | before → after | 30.3 ± 8.5 ms | 29.5 ± 7.1 ms | 24.6 ms | 20.8 ms | **1.18×** | | 4 | after → before | 32.6 ± 7.6 ms | 28.0 ± 6.8 ms | 24.0 ms | 20.7 ms | **1.16×** | After is faster in every one of the 4 passes; mean is noisy on the host but min values are tight at **1.16–1.27× faster** (best observed 18.7 vs 23.1 ms, ~19% reduction). Output byte-identical (2,000,000 bytes both sides). ## Test plan - [x] `./mill __.reformat` - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 519/519 pass - [x] Scala Native A/B hyperfine — 4 interleaved-order passes, all positive; output byte-identical --- > Independent of #875; can land in either order. After both land, the `import java.io.StringWriter` in `ManifestModule.scala` can be removed in a small cleanup.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
std.manifestJson,std.manifestJsonMinified, andstd.manifestJsonExstill routed throughStringWriter, payingStringBuffersynchronization perwriteand perflushon the hot manifestation path. Source-built jrsonnet comparisons showed sjsonnet trailing on object-heavy manifest workloads.Modification
StringBuilderWriter: an unsynchronizedWriterover aStringBuilder.FastMaterializeJsonRendererbacked byStringBuilderWriter; route the threestd.manifestJson*builtins through it. PublicMaterializeJsonRendererABI/shape unchanged.UnicodeHandlingTestsextended for the prefix-ordering case.Result
Scala Native hyperfine on kube-prometheus,
-N -w 4 -m 20, jrsonnet HEAD2d7eed05:manifestJsonEx, sjsonnetmanifestJsonEx, jrsonnetJMH regression post-PR:
manifestJsonEx0.055 ms/op,realistic243.6 ms/op,gen_big_object0.842 ms/op.Related: #666.
Test plan
./mill __.reformat./mill -j 1 __.test— 517/517 passFollow-up stacked optimizations
Each commit below was verified for byte-identical output and measured before landing. Perf bar: JVM-positive and Native-non-regressing (changes that measured neutral/negative on Native — a YAML-renderer swap, a binary-operator Position deferral, and a first char-deboxing attempt — were measured and dropped).
skip escape scan for AsciiSafeStrVal.AsciiSafeStrwithout the SWAR escape scanunsynchronized StringBuilderWriter in TomlRendererStringBuffersync on themanifestTomlExpathcapture parse Position without boxingParser.Poswrites thePositionstraight into fastparsesuccessValueinstead ofIndex.map(no Int box/unbox/closure per node)defer Position alloc in exprSuffix2Positiononly on a matched suffix, not on every rep-terminating attemptflush FastMaterializeJsonRenderer only at root depthdepth == 0; 4 KB initial bufferMethodology: JVM via JMH (
ParserBenchmark, plus isolated render benches added underbench/); Native via the binary's--debug-statsphase timing and interleaved hyperfine on kube-prometheus (cooled, min/p25). Render micro-wins (AsciiSafeStr) do not transfer to Native end-to-end because parse+eval dominate there; the parse-side and TOML changes do.