`AgentCoreMemorySessionManager` silently drops conversation history when the metadata-filtered `ListEvents` (`read_agent`/`read_session`) is not yet consistent

## Summary

`AgentCoreMemorySessionManager` (in `bedrock_agentcore.memory.integrations.strands`) restores a session's conversation history **only if** its `read_agent()` / `read_session()` calls find the prior `AGENT` / `SESSION` marker events. Those two reads use a **metadata-filtered `ListEvents`** query, which appears to be **eventually consistent** on the service side. When the filter transiently returns nothing (even though the matching marker event exists and the raw conversation events are fully persisted), the session manager **silently treats the turn as a brand-new agent, creates a fresh agent record, and replays no history**. The agent then runs with an empty `agent.messages`, so the model loses all prior context for that turn.

The conversation data itself is never lost — the **unfiltered** `ListEvents` (used by `list_messages()`) is strongly consistent and always returns the full history. Only the *gate* (the metadata-filtered read) is flaky, and the failure is intermittent and silent.

## Environment

- `bedrock-agentcore` 1.14.1 (also reproduced/confirmed on 1.15.1 — same logic)
- `strands-agents` 1.42.0
- Runtime: Amazon Bedrock AgentCore Runtime; Memory resource with `USER_PREFERENCE` + `SEMANTIC` long-term strategies
- Region: ap-southeast-1

## Expected vs. Actual

- **Expected:** within a stable `session_id` + `actor_id`, every follow-up turn is given the prior conversation as context (the documented "short-term memory" behavior).
- **Actual:** *intermittently*, a follow-up turn runs with no history; the model behaves as if the conversation just started. Persists across turns (each affected turn is independent). No error is raised or logged.

## Root cause

Message restoration in `RepositorySessionManager.initialize()` is gated on `read_agent()` returning a non-`None` `SessionAgent` (and the session is resolved via `read_session()` in `__init__`). Both reads are metadata-filtered `ListEvents` calls with `max_results=1`:

- `read_agent()` — filters `stateType == AGENT AND agentId == <id>` (`session_manager.py`, ~L461-467)
- `read_session()` — filters `stateType == SESSION` (`session_manager.py`, ~L318-324)
- both go through `MemoryClient.list_events(..., event_metadata=[...])`, which sends `filter={"eventMetadata":[...]}` (`memory/client.py`, ~L861-885)

The service-side **metadata filter is eventually consistent**: shortly (and sometimes not-so-shortly) after the marker events for a turn are written, a metadata-filtered query may not return them yet. When `read_agent()` returns `None`, `initialize()` takes the "new agent" branch → `create_agent()` (the `Created agent: default in session: ...` log line) → **no call to `list_messages()`** → history is not replayed.

Crucially, `list_messages()` itself does **not** use the metadata filter — it reads raw events with a plain `list_events()` and is strongly consistent. So the data needed to restore is available; the manager just never reads it because the gate failed.

There is **no retry and no fallback** from the metadata-filtered read to the strongly-consistent unfiltered read, and the miss is **silent**.

## Reproduction

1. Use a `AgentCoreMemorySessionManager` with a real Memory resource. Send turn 1 in a fresh `session_id` (e.g. "remember my number is 73").
2. Send turn 2 in the **same** `session_id` after a short delay (we saw it with gaps from ~150s up to ~2h).
3. Intermittently, turn 2's `agent.messages` is empty and the model has no memory of turn 1, while the runtime logs `Created agent: default in session: <same id>`.
4. Query the session afterwards: the raw events (both turns) are all present, and constructing a fresh `Agent(session_manager=sm)` against the same session **does** restore the full history — confirming the data was always there and the failure was a transient read at invoke time.

## Experiments we ran (to isolate it)

1. **Dumped the session via unfiltered `list_events`** → all conversation events for both turns present under the same `(memory, actor, session)`. (Rules out "data not written".)
2. **Ran the installed SDK's `read_agent()` / `read_session()` / `list_messages()`** against the live session → all returned the events (after the index had caught up). (Shows the read path is correct *when consistent*.)
3. **Constructed a real `Agent(session_manager=sm)`** exactly as our runtime does → `agent.messages` restored the full history. (Rules out "restore is broken".)
4. **Live 2-turn rapid test** against the deployed runtime (seconds apart) → turn 2 recalled the fact. (Works.)
5. **Live 2-turn test with a 150s gap** → turn 2 had no history (model: "I have no cross-conversation memory"). (Reproduced the failure.)
6. **Live 3rd turn (cold, ~15 min later)** on the rapid session → recalled the fact. (Works.)
7. **Post-hoc restore of the failed (gap) session** via the SDK → `read_session`/`read_agent` FOUND, `list_messages` returned all turns. (Proves the failure was a transient read at invoke time, not data/version.)
8. **Compared `bedrock-agentcore` 1.14.1 vs 1.15.1** → identical `list_events` and `session_manager` restore logic. (Rules out "fixed by upgrade".)

Net: 2 reproduced failures, multiple successes, same code/data/version → the only variable is the consistency of the metadata-filtered read at invoke time.

## Impact

Silent, intermittent loss of conversation context in production multi-turn agents using the documented short-term-memory integration. Hard to detect (no error, no log), and not fixable by upgrading.

## Recommended fix

Restoration should not depend on an eventually-consistent metadata-filtered read. Options, in order of preference:

1. **Restore messages from the strongly-consistent unfiltered read.** In `initialize()`, after `read_agent()`, if it returns `None` but `list_messages(session_id, agent_id)` (unfiltered) returns a non-empty history, treat the agent as existing and replay that history instead of creating a new agent. (Decouples message restore from the flaky agent/session marker lookup.)
2. **Add bounded retry with backoff** to `read_agent()` / `read_session()` for the read-after-write window when a marker event is expected.
3. **At minimum, make it observable:** log a warning when `initialize()` takes the "new agent" branch while unfiltered events for the session already exist (i.e. a likely false "new session").

A minimal, behavior-preserving version of (1): when the metadata-filtered `read_agent` misses, fall back to the unfiltered `list_messages` to decide existence + restore.

## Minimal reproduction script

Self-contained; exercises the exact SDK code path (a fresh `AgentCoreMemorySessionManager` + `Agent` per turn, as a stateless runtime invoke would). The failure is consistency/timing dependent, so the script loops over fresh sessions until it catches one; on failure it immediately proves the data exists via the strongly-consistent unfiltered read.

```python
#!/usr/bin/env python3
"""Repro: AgentCoreMemorySessionManager intermittently skips history restore.

Requires: pip install "bedrock-agentcore==1.14.1" "strands-agents==1.42.0"
          AWS credentials for an account with an AgentCore Memory resource.
Usage:    python repro.py <MEMORY_ID> [region] [gap_seconds]
"""
import sys, time, uuid

from bedrock_agentcore.memory.integrations.strands.config import AgentCoreMemoryConfig
from bedrock_agentcore.memory.integrations.strands.session_manager import AgentCoreMemorySessionManager
from strands import Agent
from strands.models import BedrockModel

MEMORY_ID = sys.argv[1]
REGION    = sys.argv[2] if len(sys.argv) > 2 else "ap-southeast-1"
GAP       = int(sys.argv[3]) if len(sys.argv) > 3 else 150
ACTOR     = "repro-actor"
MODEL     = "apac.amazon.nova-lite-v1:0"   # any cheap in-region model

def new_sm(session_id: str) -> AgentCoreMemorySessionManager:
    cfg = AgentCoreMemoryConfig(memory_id=MEMORY_ID, session_id=session_id, actor_id=ACTOR)
    return AgentCoreMemorySessionManager(cfg, REGION)

for attempt in range(1, 21):
    sid = f"repro-{uuid.uuid4()}"                      # >= 33 chars, fresh session

    # ---- turn 1: real model call so user+assistant events are persisted
    agent1 = Agent(model=BedrockModel(model_id=MODEL, region_name=REGION),
                   system_prompt="Reply in five words or fewer.",
                   session_manager=new_sm(sid), callback_handler=None)
    agent1("Remember: my lucky number is 73.")

    time.sleep(GAP)                                     # simulate the next user turn arriving later

    # ---- turn 2: brand-new manager + agent on the SAME session
    #      (exactly what a stateless runtime does on the next invoke)
    sm2 = new_sm(sid)
    agent2 = Agent(model=BedrockModel(model_id=MODEL, region_name=REGION),
                   system_prompt="Reply in five words or fewer.",
                   session_manager=sm2, callback_handler=None)

    restored   = len(agent2.messages)                          # what initialize() replayed
    unfiltered = len(sm2.list_messages(sid, agent2.agent_id))  # strongly-consistent ground truth
    print(f"[{attempt}] session={sid} restored={restored} unfiltered={unfiltered}")

    if restored == 0 and unfiltered > 0:
        print(">>> REPRODUCED: initialize() replayed nothing, yet the unfiltered "
              f"list_messages returns {unfiltered} messages for the same session — "
              "the metadata-filtered read_agent/read_session missed the marker events.")
        break
else:
    print("Not reproduced in 20 attempts — the miss window depends on service-side "
          "index consistency; retry, vary the gap, or run at higher write rates.")
```

Observed signal when it hits: `restored=0 unfiltered=4` (turn 1's user+assistant are on disk, but nothing was replayed), matching the production `Created agent: default in session: <same id>` log line on the second turn.

## Suggested patch (sketch)

Untested sketch of recommendation (1) — decouple restore from the eventually-consistent marker lookup by falling back to the strongly-consistent unfiltered read inside `read_agent()`:

```python
# bedrock_agentcore/memory/integrations/strands/session_manager.py
def read_agent(self, session_id: str, agent_id: str, **kwargs: Any) -> Optional[SessionAgent]:
    agent = self._read_agent_filtered(session_id, agent_id)   # existing metadata-filtered lookup
    if agent is not None:
        return agent

    # Fallback: the metadata-filtered ListEvents is eventually consistent and can
    # miss a just-written AGENT marker. The unfiltered read is strongly consistent —
    # if conversational events exist for this session, the agent DOES exist.
    if self.list_messages(session_id, agent_id, limit=1):
        logger.warning(
            "read_agent: metadata filter returned nothing but session %s has events; "
            "treating agent %s as existing to avoid dropping history", session_id, agent_id)
        return SessionAgent(
            agent_id=agent_id,
            state={},
            conversation_manager_state=NullConversationManager().get_state(),  # or the configured default
        )
    return None
```

Notes: the reconstructed `SessionAgent` loses any persisted `state` / `conversation_manager_state` for that one turn (they re-sync on the next write), which is strictly better than silently dropping the entire conversation. The same fallback shape applies to `read_session()`. Alternatively (recommendation 2), a bounded retry with backoff on the filtered read also closes most of the window, at the cost of latency.

## Current workaround (in our runtime)

After constructing the agent, if `session_manager` is attached but `agent.messages` is empty, we re-load history via the strongly-consistent `session_manager.list_messages(session_id, agent.agent_id)` and assign it to `agent.messages`:

```python
agent = Agent(**agent_kwargs)   # agent_kwargs includes session_manager=sm
if sm is not None and not agent.messages:
    restored = sm.list_messages(session_id, agent.agent_id)
    if restored:
        agent.messages = [m.to_message() for m in restored]
        log.warning("Memory restore fallback hit: managed restore was empty, "
                    "re-loaded %d messages via list_messages", len(agent.messages))
```

Direct assignment does not enqueue writes (verified `pending_message_count()` unchanged), so it does not re-persist or duplicate. This fully and reliably eliminates the symptom.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`AgentCoreMemorySessionManager` silently drops conversation history when the metadata-filtered `ListEvents` (`read_agent`/`read_session`) is not yet consistent #564

Summary

Environment

Expected vs. Actual

Root cause

Reproduction

Experiments we ran (to isolate it)

Impact

Recommended fix

Minimal reproduction script

Suggested patch (sketch)

Current workaround (in our runtime)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

AgentCoreMemorySessionManager silently drops conversation history when the metadata-filtered ListEvents (read_agent/read_session) is not yet consistent #564

Description

Summary

Environment

Expected vs. Actual

Root cause

Reproduction

Experiments we ran (to isolate it)

Impact

Recommended fix

Minimal reproduction script

Suggested patch (sketch)

Current workaround (in our runtime)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

`AgentCoreMemorySessionManager` silently drops conversation history when the metadata-filtered `ListEvents` (`read_agent`/`read_session`) is not yet consistent #564