Skip to content

perf(bbox-worker): refine a multi-document extraction's documents concurrently (26.6.14)#39

Merged
casc84ab merged 1 commit into
mainfrom
feat/parallel-bbox-refine-worker
Jun 17, 2026
Merged

perf(bbox-worker): refine a multi-document extraction's documents concurrently (26.6.14)#39
casc84ab merged 1 commit into
mainfrom
feat/parallel-bbox-refine-worker

Conversation

@casc84ab

Copy link
Copy Markdown
Contributor

Fans out the per-document bbox refine of a multi-document extraction with a bounded asyncio.Semaphore + gather instead of refining one document at a time.

  • New setting bbox_refine_doc_concurrency (env FLYDOCS_BBOX_REFINE_DOC_CONCURRENCY, default 4) bounds the fan-out; size it to the bbox-worker pod CPU limit.
  • gather is the barrier so the completion + webhook fire exactly once after every document settles; return_exceptions=True then re-raises the first error (preserves retry/permanent-failure semantics).
  • Validated end-to-end on a 7-document bastanteo: max concurrency 4, ~3.85x speedup on the refine leg, all 7 documents grounded.
  • Unit tests cover overlap, semaphore cap, completion barrier and single-document failure isolation.

Bumps version to 26.6.14.

…currently (26.6.14)

The bbox-refine post-processing leg refined documents in a strictly
sequential `for` loop, so a cross-document extraction paid the sum of
every document's OCR + per-page-matcher latency. A 5-document, ~290-page
job ran ~700s and blew the 600s `bbox_refine_timeout_s`, timing out on
all three attempts and re-doing every document from scratch each retry.

Documents are independent (each mutates its own `field_groups`) and the
refiner already offloads its CPU-bound word collection to a thread, so the
leg now fans out one task per document through a bounded `asyncio.gather`.
Wall-clock drops from the sum of per-doc latencies to ~the slowest one.

- New `FLYDOCS_BBOX_REFINE_DOC_CONCURRENCY` (default 4) caps the fan-out:
  OCR is CPU-bound and each doc multiplies in-flight LLM calls.
- `gather` is the barrier that keeps "the last document finished"
  well-defined — the completion transition + post-processing webhook fire
  exactly once, after every document has settled.
- Error semantics unchanged: one document's failure still fails the leg
  (via `return_exceptions=True` + re-raise), leaving no sibling task
  running mid-flight.

Tests: concurrency happens, is capped by the setting, completion waits for
every document, and a single-document failure fails the job without
stranding siblings.
@casc84ab casc84ab merged commit f0c015c into main Jun 17, 2026
8 checks passed
@casc84ab casc84ab deleted the feat/parallel-bbox-refine-worker branch June 17, 2026 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant