PDF text: dual-layer + single-layer rendering with PdfTextMode option by andiwand · Pull Request #579 · opendocument-app/OpenDocument.core

andiwand · 2026-07-01T19:26:10Z

🤖 Generated with Claude Code

Summary

Combines the prototypes from #577 and #578 into a single implementation with a user-selectable mode.

Adds PdfTextMode enum to HtmlConfig (dual_layer default, single_layer opt-in)
Both modes use line blocks (position:absolute on the line <div>, margin-left on inline run <span>s) rather than per-glyph absolute positioning — forward-compatible with future paragraph grouping

Dual-layer mode (`PdfTextMode::dual_layer`, default)

Similar approach to pdf.js:

Visual layer (<div class="vis" aria-hidden="true">): paint-order glyph rendering using fonts re-encoded to the Private Use Area. Invisible text (Tr 3/7) omitted.
Selection/search layer (<div class="sel">): transparent real-Unicode text in reading order. Runs grouped into per-baseline line blocks; gap detection inserts display:inline-block spacer spans. Each run span uses CSS text-align:justify; text-align-last:justify; text-justify:inter-character to spread characters to match the PDF advance — no JavaScript.

Single-layer mode (`PdfTextMode::single_layer`)

Similar approach to pdf2htmlEX:

Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font across all pages, then picks the most-frequent glyph for each Unicode character as the cmap winner (common case wins, not first-come-first-serve).
Clean runs (all uchar→glyph pairs match the winner): real Unicode rendered directly in the embedded font — natively selectable and findable.
Unclean runs: glyphs painted via ::before{content:attr(data-g)} CSS generated content with a zero-width display:inline-block; overflow:hidden overlay <span> carrying the real Unicode for selection.
PUA-only characters (no Unicode mapping): remain visible but unselectable.

Test plan

Build passes, all 658 tests pass
Dual-layer output (style-various-1.pdf): class="vis" aria-hidden + class="sel" divs present; visual spans contain PUA bytes; selection spans contain readable Unicode
Single-layer output (--single flag on CLI): gl + ov classes present; data-g attributes contain PUA bytes; inline text contains readable Unicode
Both modes render visually correct in browser
Text selection and find-in-page work in both modes

Introduces a `PdfTextMode` enum with two values: - `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode selection/search layer. Default. - `single_layer`: single combined layer with frequency-based Unicode mapping, similar to pdf2htmlEX. The active mode is controlled by `HtmlConfig::pdf_text_mode`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Replaces the single-glyph-per-absolute-span approach with two modes, both using line blocks (position:absolute on the line div, margin-left on inline run spans) instead of per-glyph absolute positioning. Dual-layer mode (default, PdfTextMode::dual_layer): - Visual layer (<div class="vis" aria-hidden>): paint-order glyph rendering. Fonts re-encoded to PUA. Invisible text omitted. - Selection layer (<div class="sel">): transparent real-Unicode text. Runs grouped into line blocks by baseline; space detection inserts gap spans. Each run span is display:inline-block with CSS justify (text-align:justify; text-align-last:justify; text-justify:inter- character) so characters fill the PDF advance without JavaScript. - Similar approach to pdf.js. Single-layer mode (PdfTextMode::single_layer): - One combined layer per page in paint order. - Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences per font, then picks the most-frequent glyph as the cmap entry — so the common case wins, not first-come-first-serve. - Clean runs (all uchar→glyph pairs match the winner) render the real Unicode directly in the embedded font — natively selectable. - Unclean runs paint glyphs via ::before{content:attr(data-g)} with a zero-width display:inline-block overlay span for selectability. - PUA-only chars (no Unicode mapping) remain visible but unselectable. - Similar approach to pdf2htmlEX. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 74f51ee76f

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`, `escape_markup`) and a template `handle_graphic_element` replace the copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs). The single-layer `add_class` captures `styles` from scope to match the dual-layer signature; `AtomicStyles styles` is moved up before the pre- pass so the capture is valid. Two dual-layer correctness fixes (from code-review): - Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero, so embedded glyphs space correctly for PDFs with custom char/word spacing. - Move vis_prev_* state updates inside the `if (!invisible)` block so invisible/clip-mode runs do not shift the next visible run's position. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

Adds a standalone test that translates style-various-1.pdf through both dual_layer and single_layer modes and asserts the output document.html contains the expected marker classes (vis+sel for dual, line-block t for single). Prevents silent regressions if a mode is broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq

andiwand and others added 2 commits July 1, 2026 19:06

chatgpt-codex-connector Bot reviewed Jul 1, 2026

View reviewed changes

Comment thread src/odr/internal/html/pdf_file.cpp

Comment thread src/odr/internal/html/pdf_file.cpp Outdated

andiwand and others added 3 commits July 1, 2026 22:29

revert test

d1b4527

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579

PDF text: dual-layer + single-layer rendering with PdfTextMode option#579
andiwand wants to merge 5 commits into
mainfrom
pdf-text-selection

andiwand commented Jul 1, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andiwand commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dual-layer mode (PdfTextMode::dual_layer, default)

Single-layer mode (PdfTextMode::single_layer)

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

andiwand commented Jul 1, 2026 •

edited

Loading

Dual-layer mode (`PdfTextMode::dual_layer`, default)

Single-layer mode (`PdfTextMode::single_layer`)