Skip to content

Fix Apache HttpClient 5 reactor recovery#2034

Open
sharp-pixel wants to merge 3 commits into
opensearch-project:mainfrom
sharp-pixel:fix/1969-httpclient5-reactor-recovery
Open

Fix Apache HttpClient 5 reactor recovery#2034
sharp-pixel wants to merge 3 commits into
opensearch-project:mainfrom
sharp-pixel:fix/1969-httpclient5-reactor-recovery

Conversation

@sharp-pixel

Copy link
Copy Markdown

Fixes #1969

Problem

ApacheHttpClient5Transport could enter an unrecoverable state after Apache HttpClient 5 reported IOReactorShutdownException. Once the underlying I/O reactor was shut down, the transport continued using the same dead CloseableHttpAsyncClient, so every later request failed until the application recreated the transport.

One common trigger is memory pressure while buffering large or highly concurrent responses. The existing per-response buffer limit capped a single response, but it did not cap aggregate heap usage across concurrent responses, so many in-flight responses could still exhaust the heap and kill the reactor.

Solution

This change makes ApacheHttpClient5Transport recover from reactor shutdowns when it owns the client lifecycle:

  • Rebuilds and starts a fresh Apache HC5 async client when IOReactorShutdownException is detected.
  • Handles both synchronous execute(...) failures and callback-delivered reactor shutdown failures.
  • Retries the affected request once on the rebuilt client.
  • Adds rebuild backoff to avoid rebuild storms if the reactor keeps dying.
  • Prevents recovery after the transport has been explicitly closed.
  • Keeps externally supplied clients unchanged; recovery is disabled when the transport does not own the client lifecycle.

It also reduces the chance of reactor-killing memory failures:

  • Adds an optional aggregate response-buffer memory budget.
  • Adds bounded response-consumer capacity increments when a budget is active.
  • Converts OutOfMemoryError during response buffering into a request-level IOException so the reactor thread is not killed.
  • Fails fast without node denylisting/retry for deterministic client-side failures such as shared-budget exhaustion, oversized single responses, and OOM-caused buffering failures.
  • Resets leaked response-buffer budget reservations when the client is rebuilt, including request-specific budgets.

Testing

Ran:

./gradlew :java-client:test \
  --tests org.opensearch.client.transport.httpclient5.ApacheHttpClient5TransportRecoveryTest \
  --tests org.opensearch.client.transport.httpclient5.internal.ResponseMemoryBudgetTest \
  --tests org.opensearch.client.transport.httpclient5.internal.HeapBufferedAsyncEntityConsumerTest

./gradlew :java-client:test \
  --tests 'org.opensearch.client.transport.httpclient5.*' \
  --tests 'org.opensearch.client.transport.httpclient5.internal.*'

./gradlew :java-client:spotlessJavaCheck

git diff --check

Also verified the locally published modified client in a standalone sample app using the normal existing transport builder path against OpenSearch 3.7.0 in Docker:

  • ping and info/version
  • create index
  • index and search a document
  • concurrent indexing/search smoke test with one client instance

Comment on lines +219 to +220
* Releases the (potentially very large) response buffer and converts a fatal {@link OutOfMemoryError} into a
* recoverable {@link IOException}.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is safe to do this. The contract of java.lang.Error generally is that it "indicates serious problems that a reasonable application should not try to catch". I don't think you can assume the JVM is in a workable state after observing an error and the right thing to do is let the process crash. In this case, you don't actually know if the response buffer in question here caused the OOM, versus something else in the application unrelated to this client that consumed all the heap.

@andrross

Copy link
Copy Markdown
Member

Would it make sense to break this into 2 PRs? One with the fix to allow recovery from reactor shutdowns, and another PR with the changes to reduce the chance of reactor-killing memory failures.

Signed-off-by: Cédric Pelvet <cedric.pelvet@gmail.com>
Signed-off-by: Cédric Pelvet <cedric.pelvet@gmail.com>
Signed-off-by: Cédric Pelvet <cedric.pelvet@gmail.com>
@sharp-pixel sharp-pixel force-pushed the fix/1969-httpclient5-reactor-recovery branch from 3004731 to a6d50d0 Compare July 1, 2026 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]ApacheHttpClient5Transport enters unrecoverable state after IOReactorShutdownException

2 participants