Don't store resources content in memory - stream to storage (fixes #386)#643
Open
aivus wants to merge 6 commits into
Open
Don't store resources content in memory - stream to storage (fixes #386)#643aivus wants to merge 6 commits into
aivus wants to merge 6 commits into
Conversation
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Resolves at headers time with the body as a Readable stream. New
afterResponse contract: {url, statusCode, headers, getBody()}.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Binary resources are saved immediately at response time while holding the request queue slot; html/css are buffered for link rewriting and their content is freed after save. Default FS plugin pipes streams to disk. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts: # README.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #386
Reworks the scraper around streams so resource content is no longer stored in memory. Memory usage is now bounded by
requestConcurrencyinstead of total website size.Benchmark
npm run benchmark:memory(new script) — local server, 100 files × 20 MB (2 GB total), concurrency 8:How it works
lib/request.jsusesgot.streamand resolves at response-headers time with{url, statusCode, mimeType, encoding, metadata, stream}. Aretry-event wrapper keeps got's transient-error retries working for streams (up to headers time; a connection dying mid-body can't be retried with streams).lib/scraper.js: resources which need no modification (images, fonts, media, js — the bulk of bytes) are piped straight to storage at response time, while still holding the request-queue slot sorequestConcurrencykeeps bounding concurrent transfers. HTML/CSS still need full buffering for link rewriting (cheerio re-serializes the whole document), but their text is freed immediately after save.getReferenceonly needs the child's filename, and filenames are derivable from response headers (mime → type → generateFilename) before the body is consumed.requestConcurrency: 1; covered by a regression test).resource.getContentStream()to disk and removes partially written files on mid-stream failure, soignoreErrors: truenever leaves partial garbage.saveResourceactions on a streamed resource, content is buffered once so every storage can read it (a tee with in-series actions would deadlock on backpressure). Still strictly better than v6, which always buffered everything.Breaking changes (v7, documented in MIGRATION.md)
saveResourceactions:resource.getText()→resource.getContentStream()(single-use for streamed resources;await buffer(...)fromnode:stream/consumersfor whole-content storages)afterResponsereceives{response: {url, statusCode, headers, getBody()}}instead of the full got response. String returns removed; returning an object withoutbodykeeps the resource streaming (zero buffering)scrape()result no longer carries content —getText()isnullafter save (also inonResourceSaved)generateFilename'sresponseDatano longer containsbodyresponseTypein therequestoption is ignoredwebsite-scraper-puppeteer / website-scraper-phantom will need mechanical updates for the new
afterResponsecontract (response.body→await response.getBody()).Testing
npm test), eslint cleantest/functional/streamingsuite: byte-fidelity of multi-MB binaries, content freed after save, bothafterResponsepaths, multiplesaveResourcestorages, unconsumed streams, mid-stream failures (via a real local HTTP server — nock can't simulate a connection dying mid-body), redirect-to-already-requested-URL atrequestConcurrency: 1test/unit/plugins.test.jsforSaveResourceToFileSystemPluginnpm run test-e2e): 20+ real sites pass; the only failures areantipenko.pp.ua, which is currently stuck in a server-side 301 self-redirect loop (fails on master too)🤖 Generated with Claude Code