Skip to content

Don't store resources content in memory - stream to storage (fixes #386)#643

Open
aivus wants to merge 6 commits into
masterfrom
issue-386-streams
Open

Don't store resources content in memory - stream to storage (fixes #386)#643
aivus wants to merge 6 commits into
masterfrom
issue-386-streams

Conversation

@aivus

@aivus aivus commented Jul 3, 2026

Copy link
Copy Markdown
Member

Fixes #386

Reworks the scraper around streams so resource content is no longer stored in memory. Memory usage is now bounded by requestConcurrency instead of total website size.

Benchmark

npm run benchmark:memory (new script) — local server, 100 files × 20 MB (2 GB total), concurrency 8:

v6.0.0 this PR
peak RSS 2274 MB 160 MB
wall time 2.6 s 1.6 s

How it works

  • lib/request.js uses got.stream and resolves at response-headers time with {url, statusCode, mimeType, encoding, metadata, stream}. A retry-event wrapper keeps got's transient-error retries working for streams (up to headers time; a connection dying mid-body can't be retried with streams).
  • lib/scraper.js: resources which need no modification (images, fonts, media, js — the bulk of bytes) are piped straight to storage at response time, while still holding the request-queue slot so requestConcurrency keeps bounding concurrent transfers. HTML/CSS still need full buffering for link rewriting (cheerio re-serializes the whole document), but their text is freed immediately after save.
  • This answers the open question in the issue about references to not-yet-downloaded pages: getReference only needs the child's filename, and filenames are derivable from response headers (mime → type → generateFilename) before the body is consumed.
  • Redirect dedup resolves to the already-requested resource outside the queue task (awaiting inside would deadlock at requestConcurrency: 1; covered by a regression test).
  • The FS plugin pipes resource.getContentStream() to disk and removes partially written files on mid-stream failure, so ignoreErrors: true never leaves partial garbage.
  • For multiple saveResource actions on a streamed resource, content is buffered once so every storage can read it (a tee with in-series actions would deadlock on backpressure). Still strictly better than v6, which always buffered everything.

Breaking changes (v7, documented in MIGRATION.md)

  • saveResource actions: resource.getText()resource.getContentStream() (single-use for streamed resources; await buffer(...) from node:stream/consumers for whole-content storages)
  • afterResponse receives {response: {url, statusCode, headers, getBody()}} instead of the full got response. String returns removed; returning an object without body keeps the resource streaming (zero buffering)
  • scrape() result no longer carries content — getText() is null after save (also in onResourceSaved)
  • generateFilename's responseData no longer contains body
  • Binary files are written byte-for-byte from the network (previously round-tripped through a latin1 string) — bug-fix-class difference
  • responseType in the request option is ignored

website-scraper-puppeteer / website-scraper-phantom will need mechanical updates for the new afterResponse contract (response.bodyawait response.getBody()).

Testing

  • 262 unit + functional tests passing (npm test), eslint clean
  • New test/functional/streaming suite: byte-fidelity of multi-MB binaries, content freed after save, both afterResponse paths, multiple saveResource storages, unconsumed streams, mid-stream failures (via a real local HTTP server — nock can't simulate a connection dying mid-body), redirect-to-already-requested-URL at requestConcurrency: 1
  • New test/unit/plugins.test.js for SaveResourceToFileSystemPlugin
  • E2E (npm run test-e2e): 20+ real sites pass; the only failures are antipenko.pp.ua, which is currently stuck in a server-side 301 self-redirect loop (fails on master too)

🤖 Generated with Claude Code

aivus and others added 6 commits July 3, 2026 10:18
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolves at headers time with the body as a Readable stream. New
afterResponse contract: {url, statusCode, headers, getBody()}.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Binary resources are saved immediately at response time while holding
the request queue slot; html/css are buffered for link rewriting and
their content is freed after save. Default FS plugin pipes streams to
disk.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
)

Benchmark (100 files x 20 MB, concurrency 8):
- v6.0.0: peak RSS 2274 MB
- v7.0.0: peak RSS 160 MB

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't store resources content in memory

1 participant