Don't store resources content in memory - stream to storage (fixes #386) by aivus · Pull Request #643 · website-scraper/node-website-scraper

aivus · 2026-07-03T11:09:30Z

Fixes #386

Reworks the scraper around streams so resource content is no longer stored in memory. Memory usage is now bounded by requestConcurrency instead of total website size.

Benchmark

npm run benchmark:memory (new script) — local server, 100 files × 20 MB (2 GB total), concurrency 8:

	v6.0.0	this PR
peak RSS	2274 MB	160 MB
wall time	2.6 s	1.6 s

How it works

lib/request.js uses got.stream and resolves at response-headers time with {url, statusCode, mimeType, encoding, metadata, stream}. A retry-event wrapper keeps got's transient-error retries working for streams (up to headers time; a connection dying mid-body can't be retried with streams).
lib/scraper.js: resources which need no modification (images, fonts, media, js — the bulk of bytes) are piped straight to storage at response time, while still holding the request-queue slot so requestConcurrency keeps bounding concurrent transfers. HTML/CSS still need full buffering for link rewriting (cheerio re-serializes the whole document), but their text is freed immediately after save.
This answers the open question in the issue about references to not-yet-downloaded pages: getReference only needs the child's filename, and filenames are derivable from response headers (mime → type → generateFilename) before the body is consumed.
Redirect dedup resolves to the already-requested resource outside the queue task (awaiting inside would deadlock at requestConcurrency: 1; covered by a regression test).
The FS plugin pipes resource.getContentStream() to disk and removes partially written files on mid-stream failure, so ignoreErrors: true never leaves partial garbage.
For multiple saveResource actions on a streamed resource, content is buffered once so every storage can read it (a tee with in-series actions would deadlock on backpressure). Still strictly better than v6, which always buffered everything.

Breaking changes (v7, documented in MIGRATION.md)

saveResource actions: resource.getText() → resource.getContentStream() (single-use for streamed resources; await buffer(...) from node:stream/consumers for whole-content storages)
afterResponse receives {response: {url, statusCode, headers, getBody()}} instead of the full got response. String returns removed; returning an object without body keeps the resource streaming (zero buffering)
scrape() result no longer carries content — getText() is null after save (also in onResourceSaved)
generateFilename's responseData no longer contains body
Binary files are written byte-for-byte from the network (previously round-tripped through a latin1 string) — bug-fix-class difference
responseType in the request option is ignored

website-scraper-puppeteer / website-scraper-phantom will need mechanical updates for the new afterResponse contract (response.body → await response.getBody()).

Testing

262 unit + functional tests passing (npm test), eslint clean
New test/functional/streaming suite: byte-fidelity of multi-MB binaries, content freed after save, both afterResponse paths, multiple saveResource storages, unconsumed streams, mid-stream failures (via a real local HTTP server — nock can't simulate a connection dying mid-body), redirect-to-already-requested-URL at requestConcurrency: 1
New test/unit/plugins.test.js for SaveResourceToFileSystemPlugin
E2E (npm run test-e2e): 20+ real sites pass; the only failures are antipenko.pp.ua, which is currently stuck in a server-side 301 self-redirect loop (fails on master too)

🤖 Generated with Claude Code

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Resolves at headers time with the body as a Readable stream. New afterResponse contract: {url, statusCode, headers, getBody()}. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Binary resources are saved immediately at response time while holding the request queue slot; html/css are buffered for link rewriting and their content is freed after save. Default FS plugin pipes streams to disk. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

) Benchmark (100 files x 20 MB, concurrency 8): - v6.0.0: peak RSS 2274 MB - v7.0.0: peak RSS 160 MB Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

# Conflicts: # README.md

aivus and others added 6 commits July 3, 2026 10:18

Add Resource content stream model and stream/fs utils (#386)

6deb6cf

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Rewrite request layer around got.stream with retry support (#386)

9585992

Resolves at headers time with the body as a Readable stream. New afterResponse contract: {url, statusCode, headers, getBody()}. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add streaming test suites and FS plugin unit tests (#386)

74bc674

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Add memory benchmark, document v7 streaming changes, bump to 7.0.0 (#386

a3c6754

) Benchmark (100 files x 20 MB, concurrency 8): - v6.0.0: peak RSS 2274 MB - v7.0.0: peak RSS 160 MB Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/master' into issue-386-streams

ace35a4

# Conflicts: # README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Don't store resources content in memory - stream to storage (fixes #386)#643

Don't store resources content in memory - stream to storage (fixes #386)#643
aivus wants to merge 6 commits into
masterfrom
issue-386-streams

aivus commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

aivus commented Jul 3, 2026

Benchmark

How it works

Breaking changes (v7, documented in MIGRATION.md)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant