Skip to content

e2e flake: instance auto restart #3266

Description

@david-crespo

Noticed when I checked CI on a different e2e flake fix in #3265. It's not bad enough to fail the test, but it does fail twice in the same way (we retry twice, so it passed on the third try):

  1) [firefox] › test/e2e/instance-auto-restart.e2e.ts:122:1 › Auto restart popover, cooled, policy never, cooled 

    Error: expect(locator).toBeVisible() failed

    Locator: getByText(/Cooldown expiration.+, 202\d.+/)
    Expected: visible
    Timeout: 10000ms
    Error: element(s) not found

    Call log:
      - Expect "toBeVisible" with timeout 10000ms
      - waiting for getByText(/Cooldown expiration.+, 202\d.+/)


      134 |   await page.getByRole('link', { name: 'never', exact: true }).click()
      135 |
    > 136 |   await expect(page.getByText(/Cooldown expiration.+, 202\d.+/)).toBeVisible()
          |                                                                  ^
      137 |   await expect(page.getByText('restarting soon')).toBeHidden()
      138 |   await expect(page.getByText(/Last auto-restarted.+, 202\d/)).toBeVisible()
      139 |
        at /Users/runner/work/console/console/test/e2e/instance-auto-restart.e2e.ts:136:66

https://github.com/oxidecomputer/console/actions/runs/28254435147/job/83713855809

The robot did not nail it conclusively IMO but here are its conclusions:

🤖 Claude's somewhat dubious theory

Diagnosis

The flake is in test/e2e/instance-auto-restart.e2e.ts:122 — "Auto restart popover, cooled, policy never, cooled". In run 28254435147 (firefox) it failed the initial attempt and retry #1 identically, then passed retry #2, so it surfaced as flaky rather than a failure.

Failing assertion (line 136):

expect(getByText(/Cooldown expiration.+, 202\d.+/)).toBeVisible() → element(s) not found (10s timeout)

Root cause (confirmed from the trace artifact, not inferred): The "Cooldown expiration" field lives only on the instance Settings tab. At line 134 the test clicks the popover's "never" link to navigate there. The error-context snapshot shows the page was still on the Storage tab at failure time:

- tab "Storage" [selected]
- tab "Networking" / "Metrics" / "Connect" / "Settings"

The trace's frame URLs only ever show the initial goto and the /storage redirect — /settings is never reached. So the click closed the popover but did not navigate. With the page stuck on Storage, the settings-only field never renders and the assertion times out. Both attempts failed the same way because the navigation simply didn't fire either time.

Mechanism: the link is a headlessui CloseButton as={Link} (app/components/InstanceAutoRestartPopover.tsx:60-77). headlessui's CloseButton renders its Button, which layers react-aria press/hover interaction hooks (useActivePress, useHover) plus its own click→close() onto react-router's <Link>. This CloseButton as={Link} + react-router composition is a known-fragile interaction zone (headlessui discussion #3414 documents the related failure where the panel won't close because Link preventDefaults; our case is the inverse — close fires, navigation doesn't). headlessui's own recommended pattern is to avoid CloseButton as={Link} and instead use the panel's close render prop on a plain Link.

It reproduces only under CI load — 55/55 local runs on firefox passed (reduced motion is forced in all projects, so the slide-down animation isn't the variable; it's genuinely a rare event-timing race).

Proposed fix

Replace CloseButton as={Link} with a plain react-router <Link> that closes the popover via the PopoverPanel close render prop. This drops the headlessui Button/react-aria press layer wrapping the anchor, leaving react-router's click→navigate path unobstructed, while still closing the popover:

<PopoverPanel ...>
  {({ close }) => (
    // ...
    <Link
      to={pb.instanceSettings(instanceSelector)}
      onClick={() => close()}
      className="group -m-1 flex w-full items-center justify-between rounded-md px-1"
    >
      {/* policy badge + arrow */}
    </Link>
    // ...
  )}
</PopoverPanel>

Caveat worth being explicit about: because I can't reproduce locally, this fix is reasoned from the trace plus headlessui's documented guidance, not empirically verified to eliminate the flake. It's low-risk and aligns the code with the recommended pattern regardless.

Want me to implement it? I'd also add a post-click assertion in the test that the Settings tab is selected before checking the cooldown field, so any future navigation regression fails with a clear "didn't navigate" signal instead of a confusing field-not-found timeout.

Sources:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions