Skip to content

PDF: single-layer text selection with gen-time margins#578

Open
andiwand wants to merge 2 commits into
mainfrom
pdf-single-layer-selection
Open

PDF: single-layer text selection with gen-time margins#578
andiwand wants to merge 2 commits into
mainfrom
pdf-single-layer-selection

Conversation

@andiwand

@andiwand andiwand commented Jun 30, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Summary

Implements a single-layer PDF text model (à la pdf2htmlEX) as an alternative to the current dual-layer approach on pdf-text-selection-layer. The full design rationale is in src/odr/internal/pdf/SINGLE_LAYER_SELECTION_PLAN.md.

How it works:

  • One absolutely-positioned <div> per PDF line; runs inside flow inline, each nudged by a gen-time margin-left = (pdf_x − prev_end) × pt_to_px.
  • Clean glyphs (unambiguous Unicode↔glyph) get real-Unicode cmap entries → the visible text is the DOM text → natively findable, selectable, copyable. No JS.
  • Unclean glyphs (ligatures, ambiguous cmap, no_unicode) are painted via CSS ::before{content:attr(data-g)} generated content — the PUA glyph is outside the DOM text stream, so it never breaks Ctrl+F or double-click mid-word. A zero-width .ov overlay carries the real Unicode alongside.

Comparison with pdf-text-selection-layer:

pdf-text-selection-layer this branch
Layers 2 (visual PUA + transparent selection) 1 (single selectable layer)
X-position fit runtime JS letter-spacing gen-time margin-left
Font for selection unknown system font → JS fit needed embedded font (advances known)
Ligature/no_unicode PUA in DOM text + user-select:none CSS generated content (out of text stream)
JS dependency yes (on-load fit pass) none

Test plan

  • Full test suite (odr_test): 658 passed, 8 skipped (pre-existing), 0 failed
  • Visual comparison of PDF output in a browser vs pdf-text-selection-layer
  • Verify Ctrl+F / double-click / triple-click on a PDF with ligature glyphs
  • Verify .ov overlay find-contiguity across inline-block boundary (known trade-off; see plan §3)

andiwand and others added 2 commits June 30, 2026 19:58
Records the design discussion for a mostly single-layer text model (à la
pdf2htmlEX) as an alternative to the current dual-layer selection approach
on pdf-text-selection-layer. Clean glyphs carry real Unicode in one
findable/selectable layer positioned by gen-time margins; unclean glyphs
(ligatures / no_unicode) are painted via CSS generated content with an
overlapping transparent real-Unicode overlay. Design only; not implemented.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the dual-layer (visual PUA + transparent selection) scheme with a
single absolutely-positioned line block per PDF line whose runs flow inline.
Clean, unambiguous glyphs carry real Unicode cmap entries and render directly
as selectable/findable DOM text. Unclean glyphs (ligatures, ambiguous cmaps,
no_unicode) are painted via CSS `content:attr(data-g)` generated content —
keeping the PUA glyph out of the DOM text stream so it can never break a
word mid-sequence for Ctrl+F or double-click — with a zero-width transparent
`.ov` overlay carrying the real Unicode alongside.

Inter-run x-position corrections are computed at generation time as
`margin-left = (pdf_x − prev_end) × pt_to_px`; the browser renders the
embedded font whose advances are known, so no runtime JS measurement is
needed. Baseline drift > 0.6 em and backward jumps > 0.5 em open a new
block. A path or image item flushes the open block first to preserve paint
order.

See src/odr/internal/pdf/SINGLE_LAYER_SELECTION_PLAN.md for the full design
discussion (committed separately as the branch's first commit).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 0ed0ee3e8f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1036 to +1040
// Gen-time gap to the previous run's right edge (signed). This is
// the run's `margin-left`: the browser flows the previous run by
// its embedded-font advance, so this reproduces the PDF x-position
// (exact when the font's `hmtx` matches the PDF `/Widths`).
margin_px = round2((ox - prev_end) * pt_to_px);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Base flowed-run margins on rendered advances

When multiple PDF text segments are flowed into one line, this margin only accounts for the PDF gap from the previous segment's parsed /Widths advance. The browser, however, advances the previous inline run using the emitted font's own metrics (or an arbitrary fallback font when font == 0), and this change no longer absolutely positions each segment. For PDFs with non-embedded/unsupported fonts or embedded fonts whose hmtx differs from the PDF widths, every following segment on the same baseline is shifted by that metric mismatch; compute the margin from the rendered/emitted advance or keep those runs separately positioned.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant