PDF: separate transparent selection layer for text#577
Open
andiwand wants to merge 13 commits into
Open
Conversation
PDF text was emitted as one absolutely-positioned span per show-text segment, in paint order with no whitespace between runs. Browser text selection, copy and find-in-page all suffered: cross-run phrases were unmatchable, a kerning-split word was several spans, and selection drag jumped between unrelated boxes. Split text into two layers in `html/pdf_file.cpp`: - Visual layer (paint order, `user-select:none`): uniformly PUA glyphs in the embedded font; fallback runs render Unicode in a system font. Invisible runs (Tr 3/7) emit no visual span at all. - Selection layer (content/reading order, transparent `.i`, selectable): one span per run carrying the real Unicode, anchored at the run origin, emitted after the visual content so the spans are contiguous and selection flows cleanly. A content-order grouping sweep inserts a separator space on a line/column break or a wide intra-line gap (guarded against double spaces, which break literal find-in-page). Anchoring each run at its known origin keeps the highlight aligned without runtime measurement, so the output stays fully static (no JS). Because all selectable text now lives in the selection layer, the visible layer no longer renders real Unicode, so the "collapse" machinery and real-Unicode `cmap` baking are removed and fonts are re-encoded PUA-only. Visual output is pixel-identical (PUA maps to the same glyphs); net −68 lines. De-hyphenation and gap-based intra-line word separators are tracked as future work in TEXT_SELECTION_PLAN.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2bc6cf6ad
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
The selection-layer extent used `axis = hypot(m.a, m.b)`, the placement transform's x-axis length, which already folds in horizontal scaling (Tz) via the `params` factor. But `text.width` was also advanced with Tz in `segment_advances`, so `extent = text.width * axis` applied Tz twice. For condensed text (Tz < 100) this underestimated the run end and could inject separator spaces inside continuous words; for expanded text it could suppress real gaps. Divide the matrix x-axis by the Tz factor so `axis` is the bare text-matrix -> box scale: `extent` carries Tz exactly once, and `font_pt` tracks the Tz-free em the gap thresholds compare against. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
Wrap each page's two text layers in their own parents: a `vis` parent holding the graphics and unselectable glyph spans, and a `sel` parent holding the transparent selectable Unicode. The `vis` parent carries `aria-hidden="true"` so a screen reader reads only the real text in the selection layer instead of the PUA glyph code points. Both wrappers are unpositioned and zero-height (children are absolutely positioned), so the spans still anchor to the `.p` page box and stacking is unchanged. In the selection layer, merge a run into the previous span when it is a tight same-baseline continuation with no whitespace at the boundary — the case where PDF splits a single word into several runs at a TJ kerning adjustment. The whole word then lives in one text node, so double-click selects it as a unit rather than stopping at the run boundary. A boundary that already carries a space stays a separate span, so word breaks remain word breaks and double-click still selects a single word. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
Use a dedicated escape_selection_text for the transparent selection spans instead of html::escape_text. The general helper rewrites leading, trailing and doubled spaces to and tabs to  , which is wrong for this layer: the spans carry white-space:pre so every space already renders, and a non-breaking space neither matches a normal space in find-in-page nor lets double-click break between words. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
The selection spans render real Unicode in the browser's system font, whose advances differ from the embedded glyphs, so an active highlight was noticeably wider or narrower than the visible run. Each selection span now carries its true advance in `data-w` (CSS px), and a small on-load script corrects the box with a horizontal `scaleX` = target / measured about the run's left origin. The script runs per page, lazily, via IntersectionObserver: a large document only pays for the pages actually scrolled into view rather than one whole-document pass on load. Within a page it reads every width first and writes every transform second to avoid a per-span reflow. Upright runs only — a run carrying a rotation/skew matrix is left untouched, since its on-screen box is a rotated bounding box, not the local advance. The page stays fully usable without JS; this only refines the highlight rectangle, so it degrades gracefully where scripts are blocked. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
The inferred inter-word separator (and the sweep's own gap separator) was prepended to the new selection span, so with white-space:pre it rendered a space before the first glyph at the run origin. A double-click excludes surrounding whitespace, selecting the word but leaving that leading-space cell, so the highlight started a space-width left of the text. Hang the separator off the trailing end of the previous span instead, peeling any leading space off the new run's text. Every span now starts at its first glyph; the separator is deduped so copy/find-in-page still get exactly one space across the boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WzLBJNxSU8rLosZgoEBiyM
Emit the inter-word/line-break separator as its own selection span with no fit width (data-w) instead of folding it into the previous glyph span's text. A trailing separator space has no visible glyph to map onto, so the on-load scaleX fit could not both land the word and collapse the space, squeezing the word. The fit script skips spans without data-w, so glyph spans now scale cleanly. The separator reuses the previous run's placement and is deduped against a space already ending the previous run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF
The merge/no-merge boundary comment claimed words stay in separate spans to keep double-click from grabbing the whole phrase. That is false here: the separators are ordinary U+0020, which the browser breaks on across span boundaries regardless, so merging would not affect double-click. The actual reason is placement — each span is positioned at its own run origin and gets one uniform scaleX fit about its left edge, which a positional inter-word gap cannot survive. Rewrite the comment accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF
The selection-layer plan is implemented (separate transparent selection layer, content-order split/merge sweep, per-run origin anchoring, on-load scaleX fit). Compress its decisions into the pdf AGENTS.md design notes — including the two reversals from the plan's "fixed" decisions (output is no longer fully JS-free; scaleX is kept, not dropped) — and record the deferred items (de-hyphenation, gap-based word separators, semantic structure) under Other known gaps. Remove the now-redundant plan file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF
Group the transparent selection runs into one absolutely-positioned container per PDF line whose run spans flow inline, instead of one absolutely-positioned span per run. Native within-line selection, double-click and find-in-page now work and the run boxes are real. Horizontal placement within a line is purely cumulative: each inter-run separator span's data-w is the gap width, so word advances and gaps telescope to each run's true x-offset (wide table-column gaps reproduced, not collapsed). The on-load fit switches from scaleX to letter-spacing = (target - measured) / glyph_count (negative to squeeze), which is consumed during layout so the box grows and the next run flows from the corrected edge. The selection placement reuses the glyph layer's origin minus the Tc/Tw spacing classes (the fit subsumes them). Rotated/skewed (matrix) runs cannot flow or be fit, so each keeps its own single-run line block positioned by its matrix with no data-w, reproducing the old per-run absolute placement. Visual glyph layer unchanged. First rung toward native selection; paragraph-level grouping is deferred. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
The per-line container carries `white-space:pre` (from `.t`), so the newlines + indentation the HtmlWriter emits between its now-inline run spans rendered as real whitespace and shoved the runs onto a new line / indented them. Mark the container `set_inline` so the writer emits the whole line tight; this is a formatting flag only and does not change the element's CSS display. Also correct the stale `SpanOut` comment that still described both layers as flat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
What
Reworks PDF→HTML text so that selection, find-in-page, and copy work natively — without any JavaScript and without changing the visual rendering.
PDFs draw text as many absolutely-positioned runs (one per
Tj/TJstring) at arbitrary page coordinates. That is great for pixel-perfect imaging but hostile to the browser's text engine: phrases that cross a run boundary can't be found, drag-selection jumps between unrelated boxes, and copy order is unreliable.How
Split the output into two static layers:
user-select:none): the existing absolutely-positioned PUA-glyph spans, untouched, so rendering stays pixel-identical..i): one span per run carrying the real Unicode, anchored at the run origin, emitted contiguously per page. A linear O(n) sweep in content-stream order (not a global re-sort, which would scramble columns/tables) inserts a single space separator on a line/column break or a wide intra-line gap, guarded against double spaces so literal find-in-page keeps working.Because all common-case selectable text now lives in the selection layer, the visible layer no longer needs to bake real Unicode into the embedded font. The "collapse" machinery is removed and fonts are re-encoded PUA-only — visual output stays identical (PUA maps to the same glyphs), while the DOM and font subset get simpler. Net
pdf_file.cppchange is −68 lines.De-hyphenation is intentionally deferred (genuinely ambiguous without a soft-hyphen signal) and tracked as future work.
Plan
Includes
src/odr/internal/pdf/TEXT_SELECTION_PLAN.mddocumenting the approach, the rejected alternatives (scaleXneeds JS runtime measurement;position:relativecan't reclaim flow space), and future work.Verification
odr+odr_testbuild clean; all 26 PDF HTML output tests pass.compare-htmlvalidates visual equivalence; regenerating it would only compare new-vs-new.