perf: speed up manifest JSON rendering by He-Pin · Pull Request #874 · databricks/sjsonnet

He-Pin · 2026-05-28T06:41:23Z

Motivation

std.manifestJson, std.manifestJsonMinified, and std.manifestJsonEx still routed through StringWriter, paying StringBuffer synchronization per write and per flush on the hot manifestation path. Source-built jrsonnet comparisons showed sjsonnet trailing on object-heavy manifest workloads.

Modification

Add StringBuilderWriter: an unsynchronized Writer over a StringBuilder.
Add package-private FastMaterializeJsonRenderer backed by StringBuilderWriter; route the three std.manifestJson* builtins through it. Public MaterializeJsonRenderer ABI/shape unchanged.
Fix codepoint comparison for raw surrogate prefixes: equal surrogate UTF-16 code units must be decoded before deciding ordering. UnicodeHandlingTests extended for the prefix-ordering case.

Result

Scala Native hyperfine on kube-prometheus, -N -w 4 -m 20, jrsonnet HEAD 2d7eed05:

Workload (native)	Before	After	Δ
kube-prometheus, sjsonnet	158.4 ± 16.8 ms	143.7 ± 3.2 ms	−9.3%
kube-prometheus, jrsonnet	101.2 ± 4.4 ms	97.4 ± 8.6 ms	reference
`manifestJsonEx`, sjsonnet	—	5.09 ± 1.01 ms	new
`manifestJsonEx`, jrsonnet	—	4.08 ± 1.40 ms	reference

JMH regression post-PR: manifestJsonEx 0.055 ms/op, realistic2 43.6 ms/op, gen_big_object 0.842 ms/op.

Related: #666.

Test plan

./mill __.reformat
./mill -j 1 __.test — 517/517 pass

Follow-up stacked optimizations

Each commit below was verified for byte-identical output and measured before landing. Perf bar: JVM-positive and Native-non-regressing (changes that measured neutral/negative on Native — a YAML-renderer swap, a binary-operator Position deferral, and a first char-deboxing attempt — were measured and dropped).

Commit	Change	JVM	Native
`skip escape scan for AsciiSafeStr`	char renderer emits `Val.AsciiSafeStr` without the SWAR escape scan	+10% render-only	neutral
`unsynchronized StringBuilderWriter in TomlRenderer`	drop `StringBuffer` sync on the `manifestTomlEx` path	+6–14%	1.11× (≈+10%)
`capture parse Position without boxing`	`Parser.Pos` writes the `Position` straight into fastparse `successValue` instead of `Index.map` (no Int box/unbox/closure per node)	+5.4% parse	+4.5% parse
`defer Position alloc in exprSuffix2`	allocate the suffix `Position` only on a matched suffix, not on every rep-terminating attempt	+1.9% parse	neutral
`flush FastMaterializeJsonRenderer only at root depth`	accumulate in-memory, emit once at `depth == 0`; 4 KB initial buffer	—	—

Methodology: JVM via JMH (ParserBenchmark, plus isolated render benches added under bench/); Native via the binary's --debug-stats phase timing and interleaved hyperfine on kube-prometheus (cooled, min/p25). Render micro-wins (AsciiSafeStr) do not transfer to Native end-to-end because parse+eval dominate there; the parse-side and TOML changes do.

Motivation: std.manifestJson* still contributed to the local Scala Native gap versus source-built jrsonnet, especially in real-world object-heavy rendering. Modification: Add an internal StringBuilder-backed FastMaterializeJsonRenderer for std.manifestJson, std.manifestJsonMinified, and std.manifestJsonEx while preserving the public MaterializeJsonRenderer StringWriter API. Reuse an in-place codepoint key sorter backed by java.util.Arrays.sort, and fix raw-surrogate prefix ordering in compareStringsByCodepoint. Result: Full validation passed: ./mill --no-server --ticker false --color false __.reformat and ./mill --no-server --ticker false --color false -j 1 __.test reported 451/451 tests passing. JMH regressions: manifestJsonEx 0.055 ms/op, realistic2 43.596 ms/op, gen_big_object 0.842 ms/op. Direct hyperfine against source-built jrsonnet: manifestJsonEx sjsonnet-native 5.090 ms vs jrsonnet 4.075 ms; kube-prometheus sjsonnet-native 143.738 ms vs jrsonnet 97.385 ms.

Motivation: The JVM/char render hot path (BaseCharRenderer.visitNonNullString) ran a CharSWAR.hasEscapeChar scan on every string, even for Val.AsciiSafeStr which is statically known to need no JSON escaping (chars 0x20-0x7e, no quote/backslash). The Native ByteRenderer already had this bypass; the char path did not. Modification: - Add BaseCharRenderer.visitAsciiSafeString: quote + bulk getChars + quote, correct even under escapeUnicode since all chars are <= 0x7e. - Route Val.AsciiSafeStr through it via a Materializer.visitStr helper at the three value-string sites; ujson.Value AST path falls back to visitString. - Add AsciiSafeRenderBenchmark to isolate the render path for A/B. Result: JMH render-only, 335KB string-heavy output: 1.606 -> 1.441 ms/op (-10.3%, non-overlapping error bands). 450/450 tests pass.

Motivation: std.manifestTomlEx routed through java.io.StringWriter, whose backing StringBuffer pays a monitor enter/exit on every write/flush on the hot TOML manifestation path. The JSON renderer already switched to the unsynchronized StringBuilderWriter in databricks#874 (-9.3% on kube-prometheus native); TOML did not. Modification: - Switch TomlRenderer and the manifestTomlEx render path in ManifestModule from java.io.StringWriter to the package-private StringBuilderWriter. Output is byte-identical. std.deepJoin keeps StringWriter (separate concern). - Add TomlRenderBenchmark to A/B the render path. Result: Native hyperfine, TOML-heavy workload (1.79MB output): after ran 1.11 ± 0.07x faster than before (~10%), output byte-identical. JMH (whole-pipeline) showed AFTER < BEFORE in two independent rounds. 450/450 tests pass.

Motivation: Parser.Pos is invoked for nearly every AST node. It was `Index.map(off => new Position(...))`: fastparse's `Index` stores the offset as an Int in its `successValue: Any` field (boxing it), and the `.map` then unboxes it and allocates a closure — per node. boxToInteger via SharedPackageDefs.Index was a top self-frame in the parse flamegraph on kube-prometheus. Modification: - Rewrite Pos to write the Position object straight into successValue via ctx.freshSuccess(new Position(fileScope, ctx.index)), skipping the Int box/unbox and the map closure. Parse output (positions/errors) is unchanged. Result: JMH ParserBenchmark (parse-only, all test-suite files): 1.669 -> 1.579 ms/op (+5.4%, non-overlapping bands). Native parse_time on kube-prometheus: ~105.6 -> ~100.9 ms (+4.5%, consistent). Output byte-identical. 450/450 tests pass.

Motivation: exprSuffix2 was `Pos.flatMapX { i => CharIn(".[({")... }`, which allocated a Position on EVERY attempt — including the failing attempt that terminates `exprSuffix2.rep` after each expression. Most subexpressions have no suffix, so that trailing failed attempt (one per expression) allocated a Position that was immediately discarded. Modification: - Match the suffix char first; allocate `new Position(fileScope, ctx.index - 1)` only inside the matching branch. No suffix -> CharIn fails fast, no Position. Also drops the `.map(_(0))` Char step. Parse output (positions/errors) is unchanged. Result: JMH ParserBenchmark (-f0, same-session): 1.560 -> 1.530 ms/op (+1.9%). Native parse_time on kube-prometheus: non-regressing, min/p25 ~2% lower (noise-limited on a loaded machine). Output byte-identical. 517/517 tests pass.

Motivation: std.manifestJson* render fully in memory via FastMaterializeJsonRenderer. The inherited flushCharBuilder spilled the CharBuilder to the output writer at every sub-tree boundary, adding buffer-to-buffer copies that are pure overhead when the whole document is built in memory and emitted once. Modification: - Override flushCharBuilder to write out only when depth == 0 (root finished); accumulate everything in elemBuilder until then. - Size StringBuilderWriter's initial buffer at 4096 (was 16) to cut early reallocations, and mark it private[sjsonnet]. Result: Fewer intermediate copies on the manifestJson* path; output byte-identical.

…Chars ascii mask Adds regression coverage: - object_remove_key_directional: objectRemoveKey interaction with super / addSuper (`a+:`) merge and inline addSuper asserts. - strip_chars_ascii_mask_directional: stripChars over the ASCII range.

He-Pin · 2026-05-30T11:48:48Z

Superseded — split into focused, independently-measured PRs off current master (each output byte-identical, no benchmark code):

perf: use unsynchronized StringBuilderWriter in TomlRenderer #875 — TomlRenderer → unsynchronized StringBuilderWriter (Native ~1.14×)
perf: capture parse Position without boxing the offset Int #876 — Parser.Pos without boxing the offset Int (Native parse +6–8%, JVM +5.4%)
perf: defer Position alloc in exprSuffix2 to the matching branch #877 — defer Position alloc in exprSuffix2 (JVM +1.9%, Native neutral)

The manifest-JSON rendering work this PR was based on is already in master (da92dd1). Closing in favor of the smaller PRs above.

## Motivation `std.manifestTomlEx` had three sources of avoidable overhead on the hot manifestation path: 1. **Synchronized writer.** `TomlRenderer` and `ManifestModule.evalRhs` rendered into a `java.io.StringWriter`, whose backing `StringBuffer` pays a monitor enter/exit on every `write`/`flush`. The `FastMaterializeJsonRenderer` already uses the unsynchronized `StringBuilderWriter` (#874); TOML did not. 2. **Redundant field lookups in `renderTableInternal`.** Each key's `Val.Obj.value(k)` was resolved twice — once to classify scalar vs section, then again to render or recurse. The cache deduplicates the result, but the lookup itself still costs. 3. **Wasted indexing work.** `visibleKeyNames` was iterated and each key binary-searched back into `sortedVisibleKeyNames` — `sortedVisibleKeyNames` can be iterated directly, skipping `O(n log n)` compares per table. ## Modification Two commits: - **`perf: use unsynchronized StringBuilderWriter in TomlRenderer`** — Swap `TomlRenderer` and the `manifestTomlEx` render path in `ManifestModule` from `java.io.StringWriter` to the package-private `StringBuilderWriter`. `std.deepJoin` keeps `StringWriter` (separate concern). - **`perf: cache resolved field values and skip binary search in renderTableInternal`** — Resolve each field once into a `resolved: Array[Val]` during section classification and reuse it during render/recurse; iterate `sortedVisibleKeyNames` directly (removes the now-unused `sortedKeyIndex` binary search); hoist `childIndent = cumulatedIndent + indent` out of the section loop (was an identical allocation per sibling section); pre-size the output `StringBuilderWriter` to 1 KiB so small/medium outputs skip the first ~6 doublings. Output is byte-identical (verified at 1,228,186 bytes on the benchmark workload). ## Result Scala Native, hyperfine A/B against `master` (`fc292fa6`). Workload: object comprehension over 8000 small tables → ~1.2 MB TOML output (render-dominated). Four interleaved-order passes, `--warmup 10 --min-runs 100 --shell=none`: | pass | order | before mean | after mean | before min | after min | **min ratio** | |---|---|---:|---:|---:|---:|---:| | 1 | before → after | 59.4 ± 2.7 ms | 53.2 ± 23.4 ms | 55.4 ms | 43.8 ms | **1.27×** | | 2 | after → before | 64.1 ± 7.7 ms | 51.8 ± 12.2 ms | 56.4 ms | 43.7 ms | **1.29×** | | 3 | before → after | 64.1 ± 8.1 ms | 53.2 ± 14.3 ms | 56.4 ms | 42.0 ms | **1.34×** | | 4 | after → before | 63.3 ± 14.3 ms | 49.2 ± 3.7 ms | 57.2 ms | 42.8 ms | **1.34×** | Mean is noisy on the host (1.12× – 1.29×), but **after is faster in every one of the 4 passes** and the **min values are tight at ~1.27–1.34× faster** (best observed: 42.0 ms vs 56.4 ms, ~25.5% reduction). Output byte-identical, 1,228,186 bytes both sides. For comparison, the StringBuilderWriter swap alone (commit 1) measures ~1.08–1.14× min; the cache + binary-search elimination + childIndent hoist (commit 2) lifts that to ~1.27–1.34× min. ## Test plan - [x] `./mill __.reformat` - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 519/519 pass - [x] Scala Native A/B hyperfine — 4 interleaved-order passes, all positive; output byte-identical --- > Rebased onto current `master` (`fc292fa6`). The companion commit "speed up manifest JSON rendering" was merged separately as #879, so this PR now contains only the TomlRenderer / ManifestModule changes.

## Motivation `std.deepJoin` writes each `Val.Str` chunk into a `java.io.StringWriter` inside a tight loop. `StringWriter`'s backing `StringBuffer` pays a monitor enter/exit on every `write`/`append` call, which on a typical deepJoin walk over a deeply nested array can be hundreds of thousands of synchronized writes — wasted overhead in single-threaded jsonnet evaluation. `TomlRenderer` and `FastMaterializeJsonRenderer` already use the unsynchronized package-private `StringBuilderWriter` for the same reason (#874, #875). `std.deepJoin` was explicitly left as a follow-up in #875's description (*"std.deepJoin keeps StringWriter (separate concern)"*) — this PR is that follow-up. ## Modification Single change in `ManifestModule.scala`: swap the `new StringWriter()` in `DeepJoin.evalRhs` for `new StringBuilderWriter()`. No other code changes; output is byte-identical. ## Result Scala Native, hyperfine A/B against `master` (`fc292fa6`). Workload: a 50,000-row array of 10 pre-allocated strings → 2 MB of `deepJoin` output, render-dominated. Four interleaved-order passes, `--warmup 10 --min-runs 100 --shell=none`: | pass | order | before mean | after mean | before min | after min | **min ratio** | |---|---|---:|---:|---:|---:|---:| | 1 | before → after | 35.1 ± 16.5 ms | 32.2 ± 19.1 ms | 23.1 ms | 18.7 ms | **1.24×** | | 2 | after → before | 43.7 ± 30.6 ms | 29.9 ± 25.3 ms | 25.7 ms | 20.3 ms | **1.27×** | | 3 | before → after | 30.3 ± 8.5 ms | 29.5 ± 7.1 ms | 24.6 ms | 20.8 ms | **1.18×** | | 4 | after → before | 32.6 ± 7.6 ms | 28.0 ± 6.8 ms | 24.0 ms | 20.7 ms | **1.16×** | After is faster in every one of the 4 passes; mean is noisy on the host but min values are tight at **1.16–1.27× faster** (best observed 18.7 vs 23.1 ms, ~19% reduction). Output byte-identical (2,000,000 bytes both sides). ## Test plan - [x] `./mill __.reformat` - [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 519/519 pass - [x] Scala Native A/B hyperfine — 4 interleaved-order passes, all positive; output byte-identical --- > Independent of #875; can land in either order. After both land, the `import java.io.StringWriter` in `ManifestModule.scala` can be removed in a small cleanup.

He-Pin marked this pull request as ready for review May 28, 2026 06:53

He-Pin marked this pull request as draft May 28, 2026 06:57

He-Pin marked this pull request as ready for review May 28, 2026 07:00

He-Pin marked this pull request as draft May 28, 2026 07:12

He-Pin force-pushed the perf/manifest-json-rendering-fastpath branch from da92dd1 to c3581e8 Compare May 28, 2026 07:17

He-Pin marked this pull request as ready for review May 28, 2026 07:17

He-Pin marked this pull request as draft May 29, 2026 20:41

He-Pin marked this pull request as ready for review May 29, 2026 21:25

He-Pin marked this pull request as draft May 29, 2026 22:50

He-Pin added 4 commits May 30, 2026 15:42

He-Pin closed this May 30, 2026

This was referenced Jun 3, 2026

perf: use unsynchronized StringBuilderWriter in TomlRenderer #875

Merged

perf: use unsynchronized StringBuilderWriter in std.deepJoin #889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: speed up manifest JSON rendering#874

perf: speed up manifest JSON rendering#874
He-Pin wants to merge 7 commits into
databricks:masterfrom
He-Pin:perf/manifest-json-rendering-fastpath

He-Pin commented May 28, 2026 •

edited

Loading

Uh oh!

He-Pin commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

He-Pin commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modification

Result

Test plan

Follow-up stacked optimizations

Uh oh!

He-Pin commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

He-Pin commented May 28, 2026 •

edited

Loading