Parallelize team-provisioner Google Admin SDK calls#130
Merged
Conversation
The Provision Groups CI in registry has been failing with a read-timeout on `aws lambda invoke`. The Lambda itself was running ~90s end-to-end (well under its 300s timeout) but the AWS CLI's default 60s read-timeout was tripping and triggering retries. Root cause: per-hero and per-group Google Admin Directory API calls were issued sequentially. With 55 heroes and 35 groups, the ~230 round-trips at ~400ms each sum to ~90s. Fix in this PR: - Extract per-hero (`_sync_hero_account`, `_sync_hero_alias`), per-group (`_sync_one_group`) and per-access-group helpers from `handle_sync_groups_and_heros`. - Dispatch each step's per-item work through a ThreadPoolExecutor with `GROUP_SYNC_WORKERS` workers (default 5, env-tunable). - Pre-fetch the OAuth token once before parallel work so workers don't race to refresh it. - Worker count sized to stay under Google's per-user quota (1,500 queries / 100s ≈ 15 QPS). 5 workers × ~3 QPS each sits at the budget ceiling, well below the 40 QPS project quota. Belt-and-suspenders in the same PR: - Add `--cli-read-timeout 0` to `scripts/provision-groups.py` and `scripts/provision-teams.sh` so a future regression past 60s won't surface as a false CI failure. Expected effect: total Lambda duration drops from ~90s to ~20-25s, comfortably under the 60s CLI default (which we've also removed) and the 300s Lambda timeout.
Terraform Plan🚧 Changes detected — Plan: 0 to add, 2 to change, 0 to destroy. Plan outputLLM ReviewRisk: 🟢 LOW Routine Lambda function updates with source code hash changes for compliance_reporter and team_provisioner functions.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The registry's Provision Groups workflow has been failing with a
Read timeout on endpoint URL ... /javabin-team-provisioner/invocations.The Lambda itself isn't timing out — its config has
Timeout=300sandreal executions complete in ~90s. What's tripping is the AWS CLI's
default 60s read-timeout, which then triggers client-side retries
and three concurrent Lambda invocations per CI run before exiting with
a false-positive failure.
CloudWatch confirms: invocation
96299b07-…on 2026-05-27 ran cleanlyend-to-end with
Duration: 90,978 ms— the data was synced but CI wasalready red.
Root cause: per-hero and per-group Google Admin Directory API calls
were sequential. With 55 heroes and 35 groups, the ~230 round-trips at
~400 ms each sum to ~90 s.
Changes
_sync_hero_account,_sync_hero_alias,_sync_one_group,and
_sync_one_access_groupfromhandle_sync_groups_and_heros.ThreadPoolExecutorwithGROUP_SYNC_WORKERSworkers (default5, env-tunable on the Lambdawithout redeploying code).
workers don't race to refresh it.
--cli-read-timeout 0toscripts/provision-groups.pyandscripts/provision-teams.shas a safety net against future regressions.Sizing rationale
Google Directory API per-user quota: 1,500 queries / 100 s (≈15 QPS).
We impersonate a single admin via domain-wide delegation, so per-user
applies. 5 workers × ~3 QPS each ≈ 15 QPS — at the budget ceiling, well
below the 40 QPS project quota. If we hit 429s in practice, drop
GROUP_SYNC_WORKERSto 3 via env-var (no redeploy needed).Expected effect
Test plan
javaBin/registry'sgroups/**triggers theProvision Groups workflow; verify it completes green and the
Lambda
Durationmetric is well under 60 s.API — if observed, drop
GROUP_SYNC_WORKERSenv var to 3.Related: false-positive timeout in Provision Groups #26418541309.