Worker Deployment Models for High Availability (first ½ ready for review)#4703
Worker Deployment Models for High Availability (first ½ ready for review)#4703lukeknep wants to merge 15 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
📖 Docs PR preview links |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Deployment failed with the following error: Learn More: https://vercel.com/docs/concepts/projects/project-configuration |
| - **Active / Passive** — Workflows process in one region at a time, the "active" region. The other region is "passive" and ready for failover. This pattern has two variants: | ||
| - **[Active / Passive (Cold)](#active-cold)** — a.k.a. Active / Cold — Workers run in only one region at a time. After a failover, Workers start in the secondary region. The region where Workers run == the region where Workflows process. To fail over, Workers need a "cold start" in the other region. | ||
| - **[Active / Passive (Hot)](#active-hot)** — a.k.a. Active / Hot — Workers run in **both regions** simultaneously, but Workflows still process in only one region at any given time. The other region's Workers are on "hot" standby. | ||
| - **[Active / Active](#active-active)** — Workflows process in both regions at the same time. Necessarily, Workers run in both regions at all times. |
There was a problem hiding this comment.
nit: necessarily is an odd word to use here. Id just remove
| Active / Cold Pattern: **On failover** | ||
|
|
||
| - **The Namespace fails over automatically.** Temporal Cloud promotes the secondary region's replica to active. No action is needed to fail over the Namespace itself. | ||
| - **You bring the Workers up in the secondary region.** Because no Workers were running there, they start from nothing — a "cold" start. Starting and scaling that fleet is your responsibility, ideally through tested automation. Until the Workers are running, no Workflows make progress. |
There was a problem hiding this comment.
I feel like the question everyone reading this is going to ask is, how do we detect a failover.
I know we have plans to answer this in H2, but is there something we want to tell them now? Like them have some sort or system that is constantly querying what the active is to detect a failover? Or do we just want to wait for the question and address it then?
There was a problem hiding this comment.
It could just be one of those things where we fix the problem before we expect to be asked about it.
There was a problem hiding this comment.
Another thing I thought about is them knowing when to scale down those workers and do their own failback
| Active / Cold Pattern: **Tradeoffs** | ||
|
|
||
| - Highest overall recovery time of the three patterns, due to cold starting the Worker fleet after failover. | ||
| - Depends on tested automation to bring up the secondary-region fleet quickly. |
There was a problem hiding this comment.
"tested automation", I see this 3 times and as a user I'd have no idea what this means personally.
|
|
||
| - **Use the Namespace Endpoint.** | ||
| - Connect Workers through the [Namespace Endpoint](/cloud/namespaces#access-namespaces), which always connects to the Namespace in its active region and automatically fails over to the new region. | ||
| - **Rationale:** If a Temporal Cloud incident requires the Namespace to fail over while the rest of the primary region is healthy, the Workers in the primary region can still connect through the Namespace Endpoint and process Workflows. If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region. |
There was a problem hiding this comment.
If the Workers use the Regional Endpoint for the primary region, they will not reliably connect to the Namespace during a Temporal Cloud incident in the primary region.
won't they be forwarded?
There was a problem hiding this comment.
ah I see lower about turning off forwarding. This seems like this would be a really good feature to have in the worker and pass up the flag. Cause if you know you are connecting to a regional endpoint, and you don't want to have forwarding, seeing it all in one spot in the code is much more clear than having to set the regional endpoint in the worker and make a cli call externally.
just a thought
| - **Codec Servers and proxies** — run in both regions continuously. | ||
| - **Databases and queues** — accessed from both regions; cross-region consistency must be designed for. | ||
|
|
||
| ### Dual Active (Multi-Active) {/* #dual-active */} |
There was a problem hiding this comment.
I'm a little confused about this one. Is this not just taking the active passive pattern and now just doing it for 2 namespaces now? I guess I'm confused about this being here when we already have active passive.
Like is this pattern here just really saying "you can have different namespaces in different regions"?
|
|
||
| | Pattern | Best for | Major benefits | Major tradeoffs | | ||
| | --- | --- | --- | --- | | ||
| | **[Active/Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | |
There was a problem hiding this comment.
Isn't failing over workers always the user's responsibility? The biggest tradeoff here is the cold start right?
| | --- | --- | --- | --- | | ||
| | **[Active/Passive (Cold)](#active-cold)** | Easy initial deployment | Acts like a single region; no special setup required | Failing over Workers is the user's responsibility | | ||
|
|
||
| ```mermaid |
There was a problem hiding this comment.
The other diagrams look great. I don't these 6 are super helpful though. You already have basically the same diagrams in the later sections to illustrate each in detail. By including them here you are also losing the comparative benefits of the table because the now the users can't easily look at each row side by side. I'd remove these mermaids here and keep a simple table
|
|
||
| ### Active/Passive (Cold) {/* #active-cold */} | ||
|
|
||
| _Also known as "Active/Cold Standby", "Active/Cold", or simply "Active/Passive"._ |
There was a problem hiding this comment.
Instead of listing all these alternate names, I think it's less confusing to just use one consistent name throughout.
|
|
||
| ### Which pattern has the lowest recovery time (RTO)? {/* #faq-lowest-rto */} | ||
|
|
||
| **Active/Passive (Hot)** achieves the lowest recovery time, because a standby Worker fleet already runs in the secondary region and begins processing the moment it becomes active — no cold start. See [Active/Passive (Hot)](#active-hot) and [RPO and RTO](/cloud/rpo-rto). |
There was a problem hiding this comment.
Wouldn't Active/Active have the lowest recovery time?
We say this earlier on the page:
Active/Active Pattern: Benefits
- Low overall recovery time. The surviving region keeps processing while capacity scales up.
|
|
||
| To understand the recovery objectives each pattern is measured against, see [RPO and RTO](/cloud/rpo-rto). | ||
|
|
||
| ## Frequently asked questions {/* #faq */} |
There was a problem hiding this comment.
How attached are you to this FAQ section? It looks like it is repeating things we already addressed before in the page and in greater detail.
…tation into ha-worker-deployments
What does this PR do?
When using multi-region High Availability, Temporal Cloud customers often ask us how to decide where to deploy their Workers and other systems.
This page gives recommendations on common patterns for an overall High Availability strategy that a Temporal Cloud user can adopt in their architecture.
Notes to reviewers
┆Attachments: EDU-6522 [draft] High Availability Deployment Models page