Skip to content

perf: speed up manifest JSON rendering#874

Closed
He-Pin wants to merge 7 commits into
databricks:masterfrom
He-Pin:perf/manifest-json-rendering-fastpath
Closed

perf: speed up manifest JSON rendering#874
He-Pin wants to merge 7 commits into
databricks:masterfrom
He-Pin:perf/manifest-json-rendering-fastpath

Conversation

@He-Pin
Copy link
Copy Markdown
Contributor

@He-Pin He-Pin commented May 28, 2026

Motivation

std.manifestJson, std.manifestJsonMinified, and std.manifestJsonEx still routed through StringWriter, paying StringBuffer synchronization per write and per flush on the hot manifestation path. Source-built jrsonnet comparisons showed sjsonnet trailing on object-heavy manifest workloads.

Modification

  • Add StringBuilderWriter: an unsynchronized Writer over a StringBuilder.
  • Add package-private FastMaterializeJsonRenderer backed by StringBuilderWriter; route the three std.manifestJson* builtins through it. Public MaterializeJsonRenderer ABI/shape unchanged.
  • Fix codepoint comparison for raw surrogate prefixes: equal surrogate UTF-16 code units must be decoded before deciding ordering. UnicodeHandlingTests extended for the prefix-ordering case.

Result

Scala Native hyperfine on kube-prometheus, -N -w 4 -m 20, jrsonnet HEAD 2d7eed05:

Workload (native) Before After Δ
kube-prometheus, sjsonnet 158.4 ± 16.8 ms 143.7 ± 3.2 ms −9.3%
kube-prometheus, jrsonnet 101.2 ± 4.4 ms 97.4 ± 8.6 ms reference
manifestJsonEx, sjsonnet 5.09 ± 1.01 ms new
manifestJsonEx, jrsonnet 4.08 ± 1.40 ms reference

JMH regression post-PR: manifestJsonEx 0.055 ms/op, realistic2 43.6 ms/op, gen_big_object 0.842 ms/op.

Related: #666.

Test plan

  • ./mill __.reformat
  • ./mill -j 1 __.test — 517/517 pass

Follow-up stacked optimizations

Each commit below was verified for byte-identical output and measured before landing. Perf bar: JVM-positive and Native-non-regressing (changes that measured neutral/negative on Native — a YAML-renderer swap, a binary-operator Position deferral, and a first char-deboxing attempt — were measured and dropped).

Commit Change JVM Native
skip escape scan for AsciiSafeStr char renderer emits Val.AsciiSafeStr without the SWAR escape scan +10% render-only neutral
unsynchronized StringBuilderWriter in TomlRenderer drop StringBuffer sync on the manifestTomlEx path +6–14% 1.11× (≈+10%)
capture parse Position without boxing Parser.Pos writes the Position straight into fastparse successValue instead of Index.map (no Int box/unbox/closure per node) +5.4% parse +4.5% parse
defer Position alloc in exprSuffix2 allocate the suffix Position only on a matched suffix, not on every rep-terminating attempt +1.9% parse neutral
flush FastMaterializeJsonRenderer only at root depth accumulate in-memory, emit once at depth == 0; 4 KB initial buffer

Methodology: JVM via JMH (ParserBenchmark, plus isolated render benches added under bench/); Native via the binary's --debug-stats phase timing and interleaved hyperfine on kube-prometheus (cooled, min/p25). Render micro-wins (AsciiSafeStr) do not transfer to Native end-to-end because parse+eval dominate there; the parse-side and TOML changes do.

@He-Pin He-Pin marked this pull request as ready for review May 28, 2026 06:53
@He-Pin He-Pin marked this pull request as draft May 28, 2026 06:57
@He-Pin He-Pin marked this pull request as ready for review May 28, 2026 07:00
@He-Pin He-Pin marked this pull request as draft May 28, 2026 07:12
Motivation:
std.manifestJson* still contributed to the local Scala Native gap versus source-built jrsonnet, especially in real-world object-heavy rendering.

Modification:
Add an internal StringBuilder-backed FastMaterializeJsonRenderer for std.manifestJson, std.manifestJsonMinified, and std.manifestJsonEx while preserving the public MaterializeJsonRenderer StringWriter API. Reuse an in-place codepoint key sorter backed by java.util.Arrays.sort, and fix raw-surrogate prefix ordering in compareStringsByCodepoint.

Result:
Full validation passed: ./mill --no-server --ticker false --color false __.reformat and ./mill --no-server --ticker false --color false -j 1 __.test reported 451/451 tests passing. JMH regressions: manifestJsonEx 0.055 ms/op, realistic2 43.596 ms/op, gen_big_object 0.842 ms/op. Direct hyperfine against source-built jrsonnet: manifestJsonEx sjsonnet-native 5.090 ms vs jrsonnet 4.075 ms; kube-prometheus sjsonnet-native 143.738 ms vs jrsonnet 97.385 ms.
@He-Pin He-Pin force-pushed the perf/manifest-json-rendering-fastpath branch from da92dd1 to c3581e8 Compare May 28, 2026 07:17
@He-Pin He-Pin marked this pull request as ready for review May 28, 2026 07:17
@He-Pin He-Pin marked this pull request as draft May 29, 2026 20:41
Motivation:
The JVM/char render hot path (BaseCharRenderer.visitNonNullString) ran a
CharSWAR.hasEscapeChar scan on every string, even for Val.AsciiSafeStr which
is statically known to need no JSON escaping (chars 0x20-0x7e, no quote/backslash).
The Native ByteRenderer already had this bypass; the char path did not.

Modification:
- Add BaseCharRenderer.visitAsciiSafeString: quote + bulk getChars + quote,
  correct even under escapeUnicode since all chars are <= 0x7e.
- Route Val.AsciiSafeStr through it via a Materializer.visitStr helper at the
  three value-string sites; ujson.Value AST path falls back to visitString.
- Add AsciiSafeRenderBenchmark to isolate the render path for A/B.

Result:
JMH render-only, 335KB string-heavy output: 1.606 -> 1.441 ms/op (-10.3%,
non-overlapping error bands). 450/450 tests pass.
@He-Pin He-Pin marked this pull request as ready for review May 29, 2026 21:25
Motivation:
std.manifestTomlEx routed through java.io.StringWriter, whose backing
StringBuffer pays a monitor enter/exit on every write/flush on the hot TOML
manifestation path. The JSON renderer already switched to the unsynchronized
StringBuilderWriter in databricks#874 (-9.3% on kube-prometheus native); TOML did not.

Modification:
- Switch TomlRenderer and the manifestTomlEx render path in ManifestModule from
  java.io.StringWriter to the package-private StringBuilderWriter. Output is
  byte-identical. std.deepJoin keeps StringWriter (separate concern).
- Add TomlRenderBenchmark to A/B the render path.

Result:
Native hyperfine, TOML-heavy workload (1.79MB output): after ran 1.11 ± 0.07x
faster than before (~10%), output byte-identical. JMH (whole-pipeline) showed
AFTER < BEFORE in two independent rounds. 450/450 tests pass.
@He-Pin He-Pin marked this pull request as draft May 29, 2026 22:50
He-Pin added 4 commits May 30, 2026 15:42
Motivation:
Parser.Pos is invoked for nearly every AST node. It was `Index.map(off => new
Position(...))`: fastparse's `Index` stores the offset as an Int in its
`successValue: Any` field (boxing it), and the `.map` then unboxes it and
allocates a closure — per node. boxToInteger via SharedPackageDefs.Index was a
top self-frame in the parse flamegraph on kube-prometheus.

Modification:
- Rewrite Pos to write the Position object straight into successValue via
  ctx.freshSuccess(new Position(fileScope, ctx.index)), skipping the Int
  box/unbox and the map closure. Parse output (positions/errors) is unchanged.

Result:
JMH ParserBenchmark (parse-only, all test-suite files): 1.669 -> 1.579 ms/op
(+5.4%, non-overlapping bands). Native parse_time on kube-prometheus:
~105.6 -> ~100.9 ms (+4.5%, consistent). Output byte-identical. 450/450 tests pass.
Motivation:
exprSuffix2 was `Pos.flatMapX { i => CharIn(".[({")... }`, which allocated a
Position on EVERY attempt — including the failing attempt that terminates
`exprSuffix2.rep` after each expression. Most subexpressions have no suffix, so
that trailing failed attempt (one per expression) allocated a Position that was
immediately discarded.

Modification:
- Match the suffix char first; allocate `new Position(fileScope, ctx.index - 1)`
  only inside the matching branch. No suffix -> CharIn fails fast, no Position.
  Also drops the `.map(_(0))` Char step. Parse output (positions/errors) is
  unchanged.

Result:
JMH ParserBenchmark (-f0, same-session): 1.560 -> 1.530 ms/op (+1.9%). Native
parse_time on kube-prometheus: non-regressing, min/p25 ~2% lower (noise-limited
on a loaded machine). Output byte-identical. 517/517 tests pass.
Motivation:
std.manifestJson* render fully in memory via FastMaterializeJsonRenderer. The
inherited flushCharBuilder spilled the CharBuilder to the output writer at every
sub-tree boundary, adding buffer-to-buffer copies that are pure overhead when the
whole document is built in memory and emitted once.

Modification:
- Override flushCharBuilder to write out only when depth == 0 (root finished);
  accumulate everything in elemBuilder until then.
- Size StringBuilderWriter's initial buffer at 4096 (was 16) to cut early
  reallocations, and mark it private[sjsonnet].

Result:
Fewer intermediate copies on the manifestJson* path; output byte-identical.
…Chars ascii mask

Adds regression coverage:
- object_remove_key_directional: objectRemoveKey interaction with super /
  addSuper (`a+:`) merge and inline addSuper asserts.
- strip_chars_ascii_mask_directional: stripChars over the ASCII range.
@He-Pin
Copy link
Copy Markdown
Contributor Author

He-Pin commented May 30, 2026

Superseded — split into focused, independently-measured PRs off current master (each output byte-identical, no benchmark code):

The manifest-JSON rendering work this PR was based on is already in master (da92dd1). Closing in favor of the smaller PRs above.

@He-Pin He-Pin closed this May 30, 2026
stephenamar-db pushed a commit that referenced this pull request Jun 3, 2026
## Motivation

`std.manifestTomlEx` had three sources of avoidable overhead on the hot
manifestation path:

1. **Synchronized writer.** `TomlRenderer` and `ManifestModule.evalRhs`
rendered into a `java.io.StringWriter`, whose backing `StringBuffer`
pays a monitor enter/exit on every `write`/`flush`. The
`FastMaterializeJsonRenderer` already uses the unsynchronized
`StringBuilderWriter` (#874); TOML did not.
2. **Redundant field lookups in `renderTableInternal`.** Each key's
`Val.Obj.value(k)` was resolved twice — once to classify scalar vs
section, then again to render or recurse. The cache deduplicates the
result, but the lookup itself still costs.
3. **Wasted indexing work.** `visibleKeyNames` was iterated and each key
binary-searched back into `sortedVisibleKeyNames` —
`sortedVisibleKeyNames` can be iterated directly, skipping `O(n log n)`
compares per table.

## Modification

Two commits:

- **`perf: use unsynchronized StringBuilderWriter in TomlRenderer`** —
Swap `TomlRenderer` and the `manifestTomlEx` render path in
`ManifestModule` from `java.io.StringWriter` to the package-private
`StringBuilderWriter`. `std.deepJoin` keeps `StringWriter` (separate
concern).
- **`perf: cache resolved field values and skip binary search in
renderTableInternal`** — Resolve each field once into a `resolved:
Array[Val]` during section classification and reuse it during
render/recurse; iterate `sortedVisibleKeyNames` directly (removes the
now-unused `sortedKeyIndex` binary search); hoist `childIndent =
cumulatedIndent + indent` out of the section loop (was an identical
allocation per sibling section); pre-size the output
`StringBuilderWriter` to 1 KiB so small/medium outputs skip the first ~6
doublings.

Output is byte-identical (verified at 1,228,186 bytes on the benchmark
workload).

## Result

Scala Native, hyperfine A/B against `master` (`fc292fa6`). Workload:
object comprehension over 8000 small tables → ~1.2 MB TOML output
(render-dominated). Four interleaved-order passes, `--warmup 10
--min-runs 100 --shell=none`:

| pass | order | before mean | after mean | before min | after min |
**min ratio** |
|---|---|---:|---:|---:|---:|---:|
| 1 | before → after | 59.4 ± 2.7 ms | 53.2 ± 23.4 ms | 55.4 ms | 43.8
ms | **1.27×** |
| 2 | after → before | 64.1 ± 7.7 ms | 51.8 ± 12.2 ms | 56.4 ms | 43.7
ms | **1.29×** |
| 3 | before → after | 64.1 ± 8.1 ms | 53.2 ± 14.3 ms | 56.4 ms | 42.0
ms | **1.34×** |
| 4 | after → before | 63.3 ± 14.3 ms | 49.2 ± 3.7 ms | 57.2 ms | 42.8
ms | **1.34×** |

Mean is noisy on the host (1.12× – 1.29×), but **after is faster in
every one of the 4 passes** and the **min values are tight at
~1.27–1.34× faster** (best observed: 42.0 ms vs 56.4 ms, ~25.5%
reduction). Output byte-identical, 1,228,186 bytes both sides.

For comparison, the StringBuilderWriter swap alone (commit 1) measures
~1.08–1.14× min; the cache + binary-search elimination + childIndent
hoist (commit 2) lifts that to ~1.27–1.34× min.

## Test plan

- [x] `./mill __.reformat`
- [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 519/519 pass
- [x] Scala Native A/B hyperfine — 4 interleaved-order passes, all
positive; output byte-identical

---

> Rebased onto current `master` (`fc292fa6`). The companion commit
"speed up manifest JSON rendering" was merged separately as #879, so
this PR now contains only the TomlRenderer / ManifestModule changes.
stephenamar-db pushed a commit that referenced this pull request Jun 3, 2026
## Motivation

`std.deepJoin` writes each `Val.Str` chunk into a `java.io.StringWriter`
inside a tight loop. `StringWriter`'s backing `StringBuffer` pays a
monitor enter/exit on every `write`/`append` call, which on a typical
deepJoin walk over a deeply nested array can be hundreds of thousands of
synchronized writes — wasted overhead in single-threaded jsonnet
evaluation.

`TomlRenderer` and `FastMaterializeJsonRenderer` already use the
unsynchronized package-private `StringBuilderWriter` for the same reason
(#874, #875). `std.deepJoin` was explicitly left as a follow-up in
#875's description (*"std.deepJoin keeps StringWriter (separate
concern)"*) — this PR is that follow-up.

## Modification

Single change in `ManifestModule.scala`: swap the `new StringWriter()`
in `DeepJoin.evalRhs` for `new StringBuilderWriter()`. No other code
changes; output is byte-identical.

## Result

Scala Native, hyperfine A/B against `master` (`fc292fa6`). Workload: a
50,000-row array of 10 pre-allocated strings → 2 MB of `deepJoin`
output, render-dominated. Four interleaved-order passes, `--warmup 10
--min-runs 100 --shell=none`:

| pass | order | before mean | after mean | before min | after min |
**min ratio** |
|---|---|---:|---:|---:|---:|---:|
| 1 | before → after | 35.1 ± 16.5 ms | 32.2 ± 19.1 ms | 23.1 ms | 18.7
ms | **1.24×** |
| 2 | after → before | 43.7 ± 30.6 ms | 29.9 ± 25.3 ms | 25.7 ms | 20.3
ms | **1.27×** |
| 3 | before → after | 30.3 ± 8.5 ms | 29.5 ± 7.1 ms | 24.6 ms | 20.8 ms
| **1.18×** |
| 4 | after → before | 32.6 ± 7.6 ms | 28.0 ± 6.8 ms | 24.0 ms | 20.7 ms
| **1.16×** |

After is faster in every one of the 4 passes; mean is noisy on the host
but min values are tight at **1.16–1.27× faster** (best observed 18.7 vs
23.1 ms, ~19% reduction). Output byte-identical (2,000,000 bytes both
sides).

## Test plan

- [x] `./mill __.reformat`
- [x] `./mill 'sjsonnet.jvm[3.3.7]'.test` — 519/519 pass
- [x] Scala Native A/B hyperfine — 4 interleaved-order passes, all
positive; output byte-identical

---

> Independent of #875; can land in either order. After both land, the
`import java.io.StringWriter` in `ManifestModule.scala` can be removed
in a small cleanup.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant