Skip to content

PDF: separate transparent selection layer for text#577

Open
andiwand wants to merge 13 commits into
mainfrom
pdf-text-selection-layer
Open

PDF: separate transparent selection layer for text#577
andiwand wants to merge 13 commits into
mainfrom
pdf-text-selection-layer

Conversation

@andiwand

@andiwand andiwand commented Jun 29, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

What

Reworks PDF→HTML text so that selection, find-in-page, and copy work natively — without any JavaScript and without changing the visual rendering.

PDFs draw text as many absolutely-positioned runs (one per Tj / TJ string) at arbitrary page coordinates. That is great for pixel-perfect imaging but hostile to the browser's text engine: phrases that cross a run boundary can't be found, drag-selection jumps between unrelated boxes, and copy order is unreliable.

How

Split the output into two static layers:

  • Visual layer (paint order, user-select:none): the existing absolutely-positioned PUA-glyph spans, untouched, so rendering stays pixel-identical.
  • Selection layer (content/reading order, transparent .i): one span per run carrying the real Unicode, anchored at the run origin, emitted contiguously per page. A linear O(n) sweep in content-stream order (not a global re-sort, which would scramble columns/tables) inserts a single space separator on a line/column break or a wide intra-line gap, guarded against double spaces so literal find-in-page keeps working.

Because all common-case selectable text now lives in the selection layer, the visible layer no longer needs to bake real Unicode into the embedded font. The "collapse" machinery is removed and fonts are re-encoded PUA-only — visual output stays identical (PUA maps to the same glyphs), while the DOM and font subset get simpler. Net pdf_file.cpp change is −68 lines.

De-hyphenation is intentionally deferred (genuinely ambiguous without a soft-hyphen signal) and tracked as future work.

Plan

Includes src/odr/internal/pdf/TEXT_SELECTION_PLAN.md documenting the approach, the rejected alternatives (scaleX needs JS runtime measurement; position:relative can't reclaim flow space), and future work.

Verification

  • odr + odr_test build clean; all 26 PDF HTML output tests pass.
  • Reference output is intentionally unchanged — CI's perceptual compare-html validates visual equivalence; regenerating it would only compare new-vs-new.
  • Output remains JS-free.

PDF text was emitted as one absolutely-positioned span per show-text
segment, in paint order with no whitespace between runs. Browser text
selection, copy and find-in-page all suffered: cross-run phrases were
unmatchable, a kerning-split word was several spans, and selection drag
jumped between unrelated boxes.

Split text into two layers in `html/pdf_file.cpp`:

- Visual layer (paint order, `user-select:none`): uniformly PUA glyphs in
  the embedded font; fallback runs render Unicode in a system font.
  Invisible runs (Tr 3/7) emit no visual span at all.
- Selection layer (content/reading order, transparent `.i`, selectable):
  one span per run carrying the real Unicode, anchored at the run origin,
  emitted after the visual content so the spans are contiguous and selection
  flows cleanly. A content-order grouping sweep inserts a separator space on
  a line/column break or a wide intra-line gap (guarded against double
  spaces, which break literal find-in-page).

Anchoring each run at its known origin keeps the highlight aligned without
runtime measurement, so the output stays fully static (no JS). Because all
selectable text now lives in the selection layer, the visible layer no
longer renders real Unicode, so the "collapse" machinery and real-Unicode
`cmap` baking are removed and fonts are re-encoded PUA-only. Visual output
is pixel-identical (PUA maps to the same glyphs); net −68 lines.

De-hyphenation and gap-based intra-line word separators are tracked as
future work in TEXT_SELECTION_PLAN.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2bc6cf6ad

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/html/pdf_file.cpp
andiwand and others added 12 commits June 29, 2026 20:11
The selection-layer extent used `axis = hypot(m.a, m.b)`, the placement
transform's x-axis length, which already folds in horizontal scaling (Tz)
via the `params` factor. But `text.width` was also advanced with Tz in
`segment_advances`, so `extent = text.width * axis` applied Tz twice. For
condensed text (Tz < 100) this underestimated the run end and could inject
separator spaces inside continuous words; for expanded text it could
suppress real gaps.

Divide the matrix x-axis by the Tz factor so `axis` is the bare
text-matrix -> box scale: `extent` carries Tz exactly once, and `font_pt`
tracks the Tz-free em the gap thresholds compare against.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
Wrap each page's two text layers in their own parents: a `vis` parent
holding the graphics and unselectable glyph spans, and a `sel` parent
holding the transparent selectable Unicode. The `vis` parent carries
`aria-hidden="true"` so a screen reader reads only the real text in the
selection layer instead of the PUA glyph code points. Both wrappers are
unpositioned and zero-height (children are absolutely positioned), so the
spans still anchor to the `.p` page box and stacking is unchanged.

In the selection layer, merge a run into the previous span when it is a
tight same-baseline continuation with no whitespace at the boundary —
the case where PDF splits a single word into several runs at a TJ kerning
adjustment. The whole word then lives in one text node, so double-click
selects it as a unit rather than stopping at the run boundary. A boundary
that already carries a space stays a separate span, so word breaks remain
word breaks and double-click still selects a single word.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
Use a dedicated escape_selection_text for the transparent selection
spans instead of html::escape_text. The general helper rewrites leading,
trailing and doubled spaces to &nbsp; and tabs to &emsp;, which is wrong
for this layer: the spans carry white-space:pre so every space already
renders, and a non-breaking space neither matches a normal space in
find-in-page nor lets double-click break between words.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
The selection spans render real Unicode in the browser's system font,
whose advances differ from the embedded glyphs, so an active highlight
was noticeably wider or narrower than the visible run. Each selection
span now carries its true advance in `data-w` (CSS px), and a small
on-load script corrects the box with a horizontal `scaleX` = target /
measured about the run's left origin.

The script runs per page, lazily, via IntersectionObserver: a large
document only pays for the pages actually scrolled into view rather than
one whole-document pass on load. Within a page it reads every width
first and writes every transform second to avoid a per-span reflow.
Upright runs only — a run carrying a rotation/skew matrix is left
untouched, since its on-screen box is a rotated bounding box, not the
local advance. The page stays fully usable without JS; this only refines
the highlight rectangle, so it degrades gracefully where scripts are
blocked.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NkoJuS4jaPGvUs1eVb8UbM
The inferred inter-word separator (and the sweep's own gap separator) was
prepended to the new selection span, so with white-space:pre it rendered a
space before the first glyph at the run origin. A double-click excludes
surrounding whitespace, selecting the word but leaving that leading-space
cell, so the highlight started a space-width left of the text.

Hang the separator off the trailing end of the previous span instead, peeling
any leading space off the new run's text. Every span now starts at its first
glyph; the separator is deduped so copy/find-in-page still get exactly one
space across the boundary.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01WzLBJNxSU8rLosZgoEBiyM
Emit the inter-word/line-break separator as its own selection span with
no fit width (data-w) instead of folding it into the previous glyph
span's text. A trailing separator space has no visible glyph to map
onto, so the on-load scaleX fit could not both land the word and
collapse the space, squeezing the word. The fit script skips spans
without data-w, so glyph spans now scale cleanly. The separator reuses
the previous run's placement and is deduped against a space already
ending the previous run.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF
The merge/no-merge boundary comment claimed words stay in separate spans
to keep double-click from grabbing the whole phrase. That is false here:
the separators are ordinary U+0020, which the browser breaks on across
span boundaries regardless, so merging would not affect double-click. The
actual reason is placement — each span is positioned at its own run origin
and gets one uniform scaleX fit about its left edge, which a positional
inter-word gap cannot survive. Rewrite the comment accordingly.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF
The selection-layer plan is implemented (separate transparent selection
layer, content-order split/merge sweep, per-run origin anchoring, on-load
scaleX fit). Compress its decisions into the pdf AGENTS.md design notes —
including the two reversals from the plan's "fixed" decisions (output is no
longer fully JS-free; scaleX is kept, not dropped) — and record the deferred
items (de-hyphenation, gap-based word separators, semantic structure) under
Other known gaps. Remove the now-redundant plan file.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_013sNCgZ9CyD6jRkd4tGNxRF
Group the transparent selection runs into one absolutely-positioned
container per PDF line whose run spans flow inline, instead of one
absolutely-positioned span per run. Native within-line selection,
double-click and find-in-page now work and the run boxes are real.

Horizontal placement within a line is purely cumulative: each inter-run
separator span's data-w is the gap width, so word advances and gaps
telescope to each run's true x-offset (wide table-column gaps reproduced,
not collapsed). The on-load fit switches from scaleX to
letter-spacing = (target - measured) / glyph_count (negative to squeeze),
which is consumed during layout so the box grows and the next run flows
from the corrected edge. The selection placement reuses the glyph layer's
origin minus the Tc/Tw spacing classes (the fit subsumes them).

Rotated/skewed (matrix) runs cannot flow or be fit, so each keeps its own
single-run line block positioned by its matrix with no data-w, reproducing
the old per-run absolute placement. Visual glyph layer unchanged.

First rung toward native selection; paragraph-level grouping is deferred.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
The per-line container carries `white-space:pre` (from `.t`), so the
newlines + indentation the HtmlWriter emits between its now-inline run
spans rendered as real whitespace and shoved the runs onto a new line /
indented them. Mark the container `set_inline` so the writer emits the
whole line tight; this is a formatting flag only and does not change the
element's CSS display. Also correct the stale `SpanOut` comment that still
described both layers as flat.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant