Skip to content

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579

Open
andiwand wants to merge 5 commits into
mainfrom
pdf-text-selection
Open

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579
andiwand wants to merge 5 commits into
mainfrom
pdf-text-selection

Conversation

@andiwand

@andiwand andiwand commented Jul 1, 2026

Copy link
Copy Markdown
Member

🤖 Generated with Claude Code

Summary

Combines the prototypes from #577 and #578 into a single implementation with a user-selectable mode.

  • Adds PdfTextMode enum to HtmlConfig (dual_layer default, single_layer opt-in)
  • Both modes use line blocks (position:absolute on the line <div>, margin-left on inline run <span>s) rather than per-glyph absolute positioning — forward-compatible with future paragraph grouping

Dual-layer mode (PdfTextMode::dual_layer, default)

Similar approach to pdf.js:

  • Visual layer (<div class="vis" aria-hidden="true">): paint-order glyph rendering using fonts re-encoded to the Private Use Area. Invisible text (Tr 3/7) omitted.
  • Selection/search layer (<div class="sel">): transparent real-Unicode text in reading order. Runs grouped into per-baseline line blocks; gap detection inserts display:inline-block spacer spans. Each run span uses CSS text-align:justify; text-align-last:justify; text-justify:inter-character to spread characters to match the PDF advance — no JavaScript.

Single-layer mode (PdfTextMode::single_layer)

Similar approach to pdf2htmlEX:

  • Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font across all pages, then picks the most-frequent glyph for each Unicode character as the cmap winner (common case wins, not first-come-first-serve).
  • Clean runs (all uchar→glyph pairs match the winner): real Unicode rendered directly in the embedded font — natively selectable and findable.
  • Unclean runs: glyphs painted via ::before{content:attr(data-g)} CSS generated content with a zero-width display:inline-block; overflow:hidden overlay <span> carrying the real Unicode for selection.
  • PUA-only characters (no Unicode mapping): remain visible but unselectable.

Test plan

  • Build passes, all 658 tests pass
  • Dual-layer output (style-various-1.pdf): class="vis" aria-hidden + class="sel" divs present; visual spans contain PUA bytes; selection spans contain readable Unicode
  • Single-layer output (--single flag on CLI): gl + ov classes present; data-g attributes contain PUA bytes; inline text contains readable Unicode
  • Both modes render visually correct in browser
  • Text selection and find-in-page work in both modes

andiwand and others added 2 commits July 1, 2026 19:06
Introduces a `PdfTextMode` enum with two values:
- `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode
  selection/search layer. Default.
- `single_layer`: single combined layer with frequency-based Unicode
  mapping, similar to pdf2htmlEX.

The active mode is controlled by `HtmlConfig::pdf_text_mode`.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the single-glyph-per-absolute-span approach with two modes,
both using line blocks (position:absolute on the line div, margin-left
on inline run spans) instead of per-glyph absolute positioning.

Dual-layer mode (default, PdfTextMode::dual_layer):
- Visual layer (<div class="vis" aria-hidden>): paint-order glyph
  rendering. Fonts re-encoded to PUA. Invisible text omitted.
- Selection layer (<div class="sel">): transparent real-Unicode text.
  Runs grouped into line blocks by baseline; space detection inserts
  gap spans. Each run span is display:inline-block with CSS justify
  (text-align:justify; text-align-last:justify; text-justify:inter-
  character) so characters fill the PDF advance without JavaScript.
- Similar approach to pdf.js.

Single-layer mode (PdfTextMode::single_layer):
- One combined layer per page in paint order.
- Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences
  per font, then picks the most-frequent glyph as the cmap entry —
  so the common case wins, not first-come-first-serve.
- Clean runs (all uchar→glyph pairs match the winner) render the real
  Unicode directly in the embedded font — natively selectable.
- Unclean runs paint glyphs via ::before{content:attr(data-g)} with
  a zero-width display:inline-block overlay span for selectability.
- PUA-only chars (no Unicode mapping) remain visible but unselectable.
- Similar approach to pdf2htmlEX.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74f51ee76f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread src/odr/internal/html/pdf_file.cpp
Comment thread src/odr/internal/html/pdf_file.cpp Outdated
andiwand and others added 3 commits July 1, 2026 22:29
Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`,
`escape_markup`) and a template `handle_graphic_element` replace the
copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs).
The single-layer `add_class` captures `styles` from scope to match the
dual-layer signature; `AtomicStyles styles` is moved up before the pre-
pass so the capture is valid.

Two dual-layer correctness fixes (from code-review):
- Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero,
  so embedded glyphs space correctly for PDFs with custom char/word
  spacing.
- Move vis_prev_* state updates inside the `if (!invisible)` block so
  invisible/clip-mode runs do not shift the next visible run's position.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Adds a standalone test that translates style-various-1.pdf through both
dual_layer and single_layer modes and asserts the output document.html
contains the expected marker classes (vis+sel for dual, line-block t
for single). Prevents silent regressions if a mode is broken.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant