PDF text: dual-layer + single-layer rendering with PdfTextMode option#579
Open
andiwand wants to merge 5 commits into
Open
PDF text: dual-layer + single-layer rendering with PdfTextMode option#579andiwand wants to merge 5 commits into
andiwand wants to merge 5 commits into
Conversation
Introduces a `PdfTextMode` enum with two values: - `dual_layer`: visual (PUA glyphs, paint order) + transparent Unicode selection/search layer. Default. - `single_layer`: single combined layer with frequency-based Unicode mapping, similar to pdf2htmlEX. The active mode is controlled by `HtmlConfig::pdf_text_mode`. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Replaces the single-glyph-per-absolute-span approach with two modes,
both using line blocks (position:absolute on the line div, margin-left
on inline run spans) instead of per-glyph absolute positioning.
Dual-layer mode (default, PdfTextMode::dual_layer):
- Visual layer (<div class="vis" aria-hidden>): paint-order glyph
rendering. Fonts re-encoded to PUA. Invisible text omitted.
- Selection layer (<div class="sel">): transparent real-Unicode text.
Runs grouped into line blocks by baseline; space detection inserts
gap spans. Each run span is display:inline-block with CSS justify
(text-align:justify; text-align-last:justify; text-justify:inter-
character) so characters fill the PDF advance without JavaScript.
- Similar approach to pdf.js.
Single-layer mode (PdfTextMode::single_layer):
- One combined layer per page in paint order.
- Pre-pass frequency analysis: counts (uchar, glyph) co-occurrences
per font, then picks the most-frequent glyph as the cmap entry —
so the common case wins, not first-come-first-serve.
- Clean runs (all uchar→glyph pairs match the winner) render the real
Unicode directly in the embedded font — natively selectable.
- Unclean runs paint glyphs via ::before{content:attr(data-g)} with
a zero-width display:inline-block overlay span for selectability.
- PUA-only chars (no Unicode mapping) remain visible but unselectable.
- Similar approach to pdf2htmlEX.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 74f51ee76f
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Shared static methods (`px_decl`, `ascent_em`, `glyph_run_str`, `escape_markup`) and a template `handle_graphic_element` replace the copy-pasted lambdas in both rendering modes (-60 lines, cleaner diffs). The single-layer `add_class` captures `styles` from scope to match the dual-layer signature; `AtomicStyles styles` is moved up before the pre- pass so the capture is valid. Two dual-layer correctness fixes (from code-review): - Add letter-spacing/word-spacing to visual runs when Tc/Tw are non-zero, so embedded glyphs space correctly for PDFs with custom char/word spacing. - Move vis_prev_* state updates inside the `if (!invisible)` block so invisible/clip-mode runs do not shift the next visible run's position. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
Adds a standalone test that translates style-various-1.pdf through both dual_layer and single_layer modes and asserts the output document.html contains the expected marker classes (vis+sel for dual, line-block t for single). Prevents silent regressions if a mode is broken. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01Mq2d2eFjjCL8cHpU9pHugq
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
🤖 Generated with Claude Code
Summary
Combines the prototypes from #577 and #578 into a single implementation with a user-selectable mode.
PdfTextModeenum toHtmlConfig(dual_layerdefault,single_layeropt-in)position:absoluteon the line<div>,margin-lefton inline run<span>s) rather than per-glyph absolute positioning — forward-compatible with future paragraph groupingDual-layer mode (
PdfTextMode::dual_layer, default)Similar approach to pdf.js:
<div class="vis" aria-hidden="true">): paint-order glyph rendering using fonts re-encoded to the Private Use Area. Invisible text (Tr 3/7) omitted.<div class="sel">): transparent real-Unicode text in reading order. Runs grouped into per-baseline line blocks; gap detection insertsdisplay:inline-blockspacer spans. Each run span uses CSStext-align:justify; text-align-last:justify; text-justify:inter-characterto spread characters to match the PDF advance — no JavaScript.Single-layer mode (
PdfTextMode::single_layer)Similar approach to pdf2htmlEX:
(uchar, glyph)co-occurrences per font across all pages, then picks the most-frequent glyph for each Unicode character as the cmap winner (common case wins, not first-come-first-serve).::before{content:attr(data-g)}CSS generated content with a zero-widthdisplay:inline-block; overflow:hiddenoverlay<span>carrying the real Unicode for selection.Test plan
style-various-1.pdf):class="vis" aria-hidden+class="sel"divs present; visual spans contain PUA bytes; selection spans contain readable Unicode--singleflag on CLI):gl+ovclasses present;data-gattributes contain PUA bytes; inline text contains readable Unicode