PDF: separate transparent selection layer for text by andiwand · Pull Request #577 · opendocument-app/OpenDocument.core

andiwand · 2026-06-29T17:15:48Z

🤖 Generated with Claude Code

What

Reworks PDF→HTML text so that selection, find-in-page, and copy work natively — without any JavaScript and without changing the visual rendering.

PDFs draw text as many absolutely-positioned runs (one per Tj / TJ string) at arbitrary page coordinates. That is great for pixel-perfect imaging but hostile to the browser's text engine: phrases that cross a run boundary can't be found, drag-selection jumps between unrelated boxes, and copy order is unreliable.

How

Split the output into two static layers:

Visual layer (paint order, user-select:none): the existing absolutely-positioned PUA-glyph spans, untouched, so rendering stays pixel-identical.
Selection layer (content/reading order, transparent .i): one span per run carrying the real Unicode, anchored at the run origin, emitted contiguously per page. A linear O(n) sweep in content-stream order (not a global re-sort, which would scramble columns/tables) inserts a single space separator on a line/column break or a wide intra-line gap, guarded against double spaces so literal find-in-page keeps working.

Because all common-case selectable text now lives in the selection layer, the visible layer no longer needs to bake real Unicode into the embedded font. The "collapse" machinery is removed and fonts are re-encoded PUA-only — visual output stays identical (PUA maps to the same glyphs), while the DOM and font subset get simpler. Net pdf_file.cpp change is −68 lines.

De-hyphenation is intentionally deferred (genuinely ambiguous without a soft-hyphen signal) and tracked as future work.

Plan

Includes src/odr/internal/pdf/TEXT_SELECTION_PLAN.md documenting the approach, the rejected alternatives (scaleX needs JS runtime measurement; position:relative can't reclaim flow space), and future work.

Verification

odr + odr_test build clean; all 26 PDF HTML output tests pass.
Reference output is intentionally unchanged — CI's perceptual compare-html validates visual equivalence; regenerating it would only compare new-vs-new.
Output remains JS-free.

PDF text was emitted as one absolutely-positioned span per show-text segment, in paint order with no whitespace between runs. Browser text selection, copy and find-in-page all suffered: cross-run phrases were unmatchable, a kerning-split word was several spans, and selection drag jumped between unrelated boxes. Split text into two layers in `html/pdf_file.cpp`: - Visual layer (paint order, `user-select:none`): uniformly PUA glyphs in the embedded font; fallback runs render Unicode in a system font. Invisible runs (Tr 3/7) emit no visual span at all. - Selection layer (content/reading order, transparent `.i`, selectable): one span per run carrying the real Unicode, anchored at the run origin, emitted after the visual content so the spans are contiguous and selection flows cleanly. A content-order grouping sweep inserts a separator space on a line/column break or a wide intra-line gap (guarded against double spaces, which break literal find-in-page). Anchoring each run at its known origin keeps the highlight aligned without runtime measurement, so the output stays fully static (no JS). Because all selectable text now lives in the selection layer, the visible layer no longer renders real Unicode, so the "collapse" machinery and real-Unicode `cmap` baking are removed and fonts are re-encoded PUA-only. Visual output is pixel-identical (PUA maps to the same glyphs); net −68 lines. De-hyphenation and gap-based intra-line word separators are tracked as future work in TEXT_SELECTION_PLAN.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2bc6cf6ad

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

The selection-layer extent used `axis = hypot(m.a, m.b)`, the placement transform's x-axis length, which already folds in horizontal scaling (Tz) via the `params` factor. But `text.width` was also advanced with Tz in `segment_advances`, so `extent = text.width * axis` applied Tz twice. For condensed text (Tz < 100) this underestimated the run end and could inject separator spaces inside continuous words; for expanded text it could suppress real gaps. Divide the matrix x-axis by the Tz factor so `axis` is the bare text-matrix -> box scale: `extent` carries Tz exactly once, and `font_pt` tracks the Tz-free em the gap thresholds compare against. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM

Wrap each page's two text layers in their own parents: a `vis` parent holding the graphics and unselectable glyph spans, and a `sel` parent holding the transparent selectable Unicode. The `vis` parent carries `aria-hidden="true"` so a screen reader reads only the real text in the selection layer instead of the PUA glyph code points. Both wrappers are unpositioned and zero-height (children are absolutely positioned), so the spans still anchor to the `.p` page box and stacking is unchanged. In the selection layer, merge a run into the previous span when it is a tight same-baseline continuation with no whitespace at the boundary — the case where PDF splits a single word into several runs at a TJ kerning adjustment. The whole word then lives in one text node, so double-click selects it as a unit rather than stopping at the run boundary. A boundary that already carries a space stays a separate span, so word breaks remain word breaks and double-click still selects a single word. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM

Use a dedicated escape_selection_text for the transparent selection spans instead of html::escape_text. The general helper rewrites leading, trailing and doubled spaces to   and tabs to &emsp;, which is wrong for this layer: the spans carry white-space:pre so every space already renders, and a non-breaking space neither matches a normal space in find-in-page nor lets double-click break between words. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM

The selection spans render real Unicode in the browser's system font, whose advances differ from the embedded glyphs, so an active highlight was noticeably wider or narrower than the visible run. Each selection span now carries its true advance in `data-w` (CSS px), and a small on-load script corrects the box with a horizontal `scaleX` = target / measured about the run's left origin. The script runs per page, lazily, via IntersectionObserver: a large document only pays for the pages actually scrolled into view rather than one whole-document pass on load. Within a page it reads every width first and writes every transform second to avoid a per-span reflow. Upright runs only — a run carrying a rotation/skew matrix is left untouched, since its on-screen box is a rotated bounding box, not the local advance. The page stays fully usable without JS; this only refines the highlight rectangle, so it degrades gracefully where scripts are blocked. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM

The inferred inter-word separator (and the sweep's own gap separator) was prepended to the new selection span, so with white-space:pre it rendered a space before the first glyph at the run origin. A double-click excludes surrounding whitespace, selecting the word but leaving that leading-space cell, so the highlight started a space-width left of the text. Hang the separator off the trailing end of the previous span instead, peeling any leading space off the new run's text. Every span now starts at its first glyph; the separator is deduped so copy/find-in-page still get exactly one space across the boundary. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01WzLBJNxSU8rLosZgoEBiyM

Emit the inter-word/line-break separator as its own selection span with no fit width (data-w) instead of folding it into the previous glyph span's text. A trailing separator space has no visible glyph to map onto, so the on-load scaleX fit could not both land the word and collapse the space, squeezing the word. The fit script skips spans without data-w, so glyph spans now scale cleanly. The separator reuses the previous run's placement and is deduped against a space already ending the previous run. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF

The merge/no-merge boundary comment claimed words stay in separate spans to keep double-click from grabbing the whole phrase. That is false here: the separators are ordinary U+0020, which the browser breaks on across span boundaries regardless, so merging would not affect double-click. The actual reason is placement — each span is positioned at its own run origin and gets one uniform scaleX fit about its left edge, which a positional inter-word gap cannot survive. Rewrite the comment accordingly. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF

The selection-layer plan is implemented (separate transparent selection layer, content-order split/merge sweep, per-run origin anchoring, on-load scaleX fit). Compress its decisions into the pdf AGENTS.md design notes — including the two reversals from the plan's "fixed" decisions (output is no longer fully JS-free; scaleX is kept, not dropped) — and record the deferred items (de-hyphenation, gap-based word separators, semantic structure) under Other known gaps. Remove the now-redundant plan file. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF

Group the transparent selection runs into one absolutely-positioned container per PDF line whose run spans flow inline, instead of one absolutely-positioned span per run. Native within-line selection, double-click and find-in-page now work and the run boxes are real. Horizontal placement within a line is purely cumulative: each inter-run separator span's data-w is the gap width, so word advances and gaps telescope to each run's true x-offset (wide table-column gaps reproduced, not collapsed). The on-load fit switches from scaleX to letter-spacing = (target - measured) / glyph_count (negative to squeeze), which is consumed during layout so the box grows and the next run flows from the corrected edge. The selection placement reuses the glyph layer's origin minus the Tc/Tw spacing classes (the fit subsumes them). Rotated/skewed (matrix) runs cannot flow or be fit, so each keeps its own single-run line block positioned by its matrix with no data-w, reproducing the old per-run absolute placement. Visual glyph layer unchanged. First rung toward native selection; paragraph-level grouping is deferred. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

The per-line container carries `white-space:pre` (from `.t`), so the newlines + indentation the HtmlWriter emits between its now-inline run spans rendered as real whitespace and shoved the runs onto a new line / indented them. Mark the container `set_inline` so the writer emits the whole line tight; this is a formatting flag only and does not change the element's CSS display. Also correct the stale `SpanOut` comment that still described both layers as flat. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

chatgpt-codex-connector Bot reviewed Jun 29, 2026

View reviewed changes

Comment thread src/odr/internal/html/pdf_file.cpp

andiwand and others added 12 commits June 29, 2026 20:11

update refs

dc86f71

try again with lfs checkout

dbaa18b

andiwand mentioned this pull request Jul 1, 2026

PDF text: dual-layer + single-layer rendering with PdfTextMode option #579

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF: separate transparent selection layer for text#577

PDF: separate transparent selection layer for text#577
andiwand wants to merge 13 commits into
mainfrom
pdf-text-selection-layer

andiwand commented Jun 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andiwand commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Plan

Verification

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andiwand commented Jun 29, 2026 •

edited

Loading