Summary
AgentCoreMemorySessionManager (in bedrock_agentcore.memory.integrations.strands) restores a session's conversation history only if its read_agent() / read_session() calls find the prior AGENT / SESSION marker events. Those two reads use a metadata-filtered ListEvents query, which appears to be eventually consistent on the service side. When the filter transiently returns nothing (even though the matching marker event exists and the raw conversation events are fully persisted), the session manager silently treats the turn as a brand-new agent, creates a fresh agent record, and replays no history. The agent then runs with an empty agent.messages, so the model loses all prior context for that turn.
The conversation data itself is never lost — the unfiltered ListEvents (used by list_messages()) is strongly consistent and always returns the full history. Only the gate (the metadata-filtered read) is flaky, and the failure is intermittent and silent.
Environment
bedrock-agentcore 1.14.1 (also reproduced/confirmed on 1.15.1 — same logic)
strands-agents 1.42.0
- Runtime: Amazon Bedrock AgentCore Runtime; Memory resource with
USER_PREFERENCE + SEMANTIC long-term strategies
- Region: ap-southeast-1
Expected vs. Actual
- Expected: within a stable
session_id + actor_id, every follow-up turn is given the prior conversation as context (the documented "short-term memory" behavior).
- Actual: intermittently, a follow-up turn runs with no history; the model behaves as if the conversation just started. Persists across turns (each affected turn is independent). No error is raised or logged.
Root cause
Message restoration in RepositorySessionManager.initialize() is gated on read_agent() returning a non-None SessionAgent (and the session is resolved via read_session() in __init__). Both reads are metadata-filtered ListEvents calls with max_results=1:
read_agent() — filters stateType == AGENT AND agentId == <id> (session_manager.py, ~L461-467)
read_session() — filters stateType == SESSION (session_manager.py, ~L318-324)
- both go through
MemoryClient.list_events(..., event_metadata=[...]), which sends filter={"eventMetadata":[...]} (memory/client.py, ~L861-885)
The service-side metadata filter is eventually consistent: shortly (and sometimes not-so-shortly) after the marker events for a turn are written, a metadata-filtered query may not return them yet. When read_agent() returns None, initialize() takes the "new agent" branch → create_agent() (the Created agent: default in session: ... log line) → no call to list_messages() → history is not replayed.
Crucially, list_messages() itself does not use the metadata filter — it reads raw events with a plain list_events() and is strongly consistent. So the data needed to restore is available; the manager just never reads it because the gate failed.
There is no retry and no fallback from the metadata-filtered read to the strongly-consistent unfiltered read, and the miss is silent.
Reproduction
- Use a
AgentCoreMemorySessionManager with a real Memory resource. Send turn 1 in a fresh session_id (e.g. "remember my number is 73").
- Send turn 2 in the same
session_id after a short delay (we saw it with gaps from ~150s up to ~2h).
- Intermittently, turn 2's
agent.messages is empty and the model has no memory of turn 1, while the runtime logs Created agent: default in session: <same id>.
- Query the session afterwards: the raw events (both turns) are all present, and constructing a fresh
Agent(session_manager=sm) against the same session does restore the full history — confirming the data was always there and the failure was a transient read at invoke time.
Experiments we ran (to isolate it)
- Dumped the session via unfiltered
list_events → all conversation events for both turns present under the same (memory, actor, session). (Rules out "data not written".)
- Ran the installed SDK's
read_agent() / read_session() / list_messages() against the live session → all returned the events (after the index had caught up). (Shows the read path is correct when consistent.)
- Constructed a real
Agent(session_manager=sm) exactly as our runtime does → agent.messages restored the full history. (Rules out "restore is broken".)
- Live 2-turn rapid test against the deployed runtime (seconds apart) → turn 2 recalled the fact. (Works.)
- Live 2-turn test with a 150s gap → turn 2 had no history (model: "I have no cross-conversation memory"). (Reproduced the failure.)
- Live 3rd turn (cold, ~15 min later) on the rapid session → recalled the fact. (Works.)
- Post-hoc restore of the failed (gap) session via the SDK →
read_session/read_agent FOUND, list_messages returned all turns. (Proves the failure was a transient read at invoke time, not data/version.)
- Compared
bedrock-agentcore 1.14.1 vs 1.15.1 → identical list_events and session_manager restore logic. (Rules out "fixed by upgrade".)
Net: 2 reproduced failures, multiple successes, same code/data/version → the only variable is the consistency of the metadata-filtered read at invoke time.
Impact
Silent, intermittent loss of conversation context in production multi-turn agents using the documented short-term-memory integration. Hard to detect (no error, no log), and not fixable by upgrading.
Recommended fix
Restoration should not depend on an eventually-consistent metadata-filtered read. Options, in order of preference:
- Restore messages from the strongly-consistent unfiltered read. In
initialize(), after read_agent(), if it returns None but list_messages(session_id, agent_id) (unfiltered) returns a non-empty history, treat the agent as existing and replay that history instead of creating a new agent. (Decouples message restore from the flaky agent/session marker lookup.)
- Add bounded retry with backoff to
read_agent() / read_session() for the read-after-write window when a marker event is expected.
- At minimum, make it observable: log a warning when
initialize() takes the "new agent" branch while unfiltered events for the session already exist (i.e. a likely false "new session").
A minimal, behavior-preserving version of (1): when the metadata-filtered read_agent misses, fall back to the unfiltered list_messages to decide existence + restore.
Minimal reproduction script
Self-contained; exercises the exact SDK code path (a fresh AgentCoreMemorySessionManager + Agent per turn, as a stateless runtime invoke would). The failure is consistency/timing dependent, so the script loops over fresh sessions until it catches one; on failure it immediately proves the data exists via the strongly-consistent unfiltered read.
#!/usr/bin/env python3
"""Repro: AgentCoreMemorySessionManager intermittently skips history restore.
Requires: pip install "bedrock-agentcore==1.14.1" "strands-agents==1.42.0"
AWS credentials for an account with an AgentCore Memory resource.
Usage: python repro.py <MEMORY_ID> [region] [gap_seconds]
"""
import sys, time, uuid
from bedrock_agentcore.memory.integrations.strands.config import AgentCoreMemoryConfig
from bedrock_agentcore.memory.integrations.strands.session_manager import AgentCoreMemorySessionManager
from strands import Agent
from strands.models import BedrockModel
MEMORY_ID = sys.argv[1]
REGION = sys.argv[2] if len(sys.argv) > 2 else "ap-southeast-1"
GAP = int(sys.argv[3]) if len(sys.argv) > 3 else 150
ACTOR = "repro-actor"
MODEL = "apac.amazon.nova-lite-v1:0" # any cheap in-region model
def new_sm(session_id: str) -> AgentCoreMemorySessionManager:
cfg = AgentCoreMemoryConfig(memory_id=MEMORY_ID, session_id=session_id, actor_id=ACTOR)
return AgentCoreMemorySessionManager(cfg, REGION)
for attempt in range(1, 21):
sid = f"repro-{uuid.uuid4()}" # >= 33 chars, fresh session
# ---- turn 1: real model call so user+assistant events are persisted
agent1 = Agent(model=BedrockModel(model_id=MODEL, region_name=REGION),
system_prompt="Reply in five words or fewer.",
session_manager=new_sm(sid), callback_handler=None)
agent1("Remember: my lucky number is 73.")
time.sleep(GAP) # simulate the next user turn arriving later
# ---- turn 2: brand-new manager + agent on the SAME session
# (exactly what a stateless runtime does on the next invoke)
sm2 = new_sm(sid)
agent2 = Agent(model=BedrockModel(model_id=MODEL, region_name=REGION),
system_prompt="Reply in five words or fewer.",
session_manager=sm2, callback_handler=None)
restored = len(agent2.messages) # what initialize() replayed
unfiltered = len(sm2.list_messages(sid, agent2.agent_id)) # strongly-consistent ground truth
print(f"[{attempt}] session={sid} restored={restored} unfiltered={unfiltered}")
if restored == 0 and unfiltered > 0:
print(">>> REPRODUCED: initialize() replayed nothing, yet the unfiltered "
f"list_messages returns {unfiltered} messages for the same session — "
"the metadata-filtered read_agent/read_session missed the marker events.")
break
else:
print("Not reproduced in 20 attempts — the miss window depends on service-side "
"index consistency; retry, vary the gap, or run at higher write rates.")
Observed signal when it hits: restored=0 unfiltered=4 (turn 1's user+assistant are on disk, but nothing was replayed), matching the production Created agent: default in session: <same id> log line on the second turn.
Suggested patch (sketch)
Untested sketch of recommendation (1) — decouple restore from the eventually-consistent marker lookup by falling back to the strongly-consistent unfiltered read inside read_agent():
# bedrock_agentcore/memory/integrations/strands/session_manager.py
def read_agent(self, session_id: str, agent_id: str, **kwargs: Any) -> Optional[SessionAgent]:
agent = self._read_agent_filtered(session_id, agent_id) # existing metadata-filtered lookup
if agent is not None:
return agent
# Fallback: the metadata-filtered ListEvents is eventually consistent and can
# miss a just-written AGENT marker. The unfiltered read is strongly consistent —
# if conversational events exist for this session, the agent DOES exist.
if self.list_messages(session_id, agent_id, limit=1):
logger.warning(
"read_agent: metadata filter returned nothing but session %s has events; "
"treating agent %s as existing to avoid dropping history", session_id, agent_id)
return SessionAgent(
agent_id=agent_id,
state={},
conversation_manager_state=NullConversationManager().get_state(), # or the configured default
)
return None
Notes: the reconstructed SessionAgent loses any persisted state / conversation_manager_state for that one turn (they re-sync on the next write), which is strictly better than silently dropping the entire conversation. The same fallback shape applies to read_session(). Alternatively (recommendation 2), a bounded retry with backoff on the filtered read also closes most of the window, at the cost of latency.
Current workaround (in our runtime)
After constructing the agent, if session_manager is attached but agent.messages is empty, we re-load history via the strongly-consistent session_manager.list_messages(session_id, agent.agent_id) and assign it to agent.messages:
agent = Agent(**agent_kwargs) # agent_kwargs includes session_manager=sm
if sm is not None and not agent.messages:
restored = sm.list_messages(session_id, agent.agent_id)
if restored:
agent.messages = [m.to_message() for m in restored]
log.warning("Memory restore fallback hit: managed restore was empty, "
"re-loaded %d messages via list_messages", len(agent.messages))
Direct assignment does not enqueue writes (verified pending_message_count() unchanged), so it does not re-persist or duplicate. This fully and reliably eliminates the symptom.
Summary
AgentCoreMemorySessionManager(inbedrock_agentcore.memory.integrations.strands) restores a session's conversation history only if itsread_agent()/read_session()calls find the priorAGENT/SESSIONmarker events. Those two reads use a metadata-filteredListEventsquery, which appears to be eventually consistent on the service side. When the filter transiently returns nothing (even though the matching marker event exists and the raw conversation events are fully persisted), the session manager silently treats the turn as a brand-new agent, creates a fresh agent record, and replays no history. The agent then runs with an emptyagent.messages, so the model loses all prior context for that turn.The conversation data itself is never lost — the unfiltered
ListEvents(used bylist_messages()) is strongly consistent and always returns the full history. Only the gate (the metadata-filtered read) is flaky, and the failure is intermittent and silent.Environment
bedrock-agentcore1.14.1 (also reproduced/confirmed on 1.15.1 — same logic)strands-agents1.42.0USER_PREFERENCE+SEMANTIClong-term strategiesExpected vs. Actual
session_id+actor_id, every follow-up turn is given the prior conversation as context (the documented "short-term memory" behavior).Root cause
Message restoration in
RepositorySessionManager.initialize()is gated onread_agent()returning a non-NoneSessionAgent(and the session is resolved viaread_session()in__init__). Both reads are metadata-filteredListEventscalls withmax_results=1:read_agent()— filtersstateType == AGENT AND agentId == <id>(session_manager.py, ~L461-467)read_session()— filtersstateType == SESSION(session_manager.py, ~L318-324)MemoryClient.list_events(..., event_metadata=[...]), which sendsfilter={"eventMetadata":[...]}(memory/client.py, ~L861-885)The service-side metadata filter is eventually consistent: shortly (and sometimes not-so-shortly) after the marker events for a turn are written, a metadata-filtered query may not return them yet. When
read_agent()returnsNone,initialize()takes the "new agent" branch →create_agent()(theCreated agent: default in session: ...log line) → no call tolist_messages()→ history is not replayed.Crucially,
list_messages()itself does not use the metadata filter — it reads raw events with a plainlist_events()and is strongly consistent. So the data needed to restore is available; the manager just never reads it because the gate failed.There is no retry and no fallback from the metadata-filtered read to the strongly-consistent unfiltered read, and the miss is silent.
Reproduction
AgentCoreMemorySessionManagerwith a real Memory resource. Send turn 1 in a freshsession_id(e.g. "remember my number is 73").session_idafter a short delay (we saw it with gaps from ~150s up to ~2h).agent.messagesis empty and the model has no memory of turn 1, while the runtime logsCreated agent: default in session: <same id>.Agent(session_manager=sm)against the same session does restore the full history — confirming the data was always there and the failure was a transient read at invoke time.Experiments we ran (to isolate it)
list_events→ all conversation events for both turns present under the same(memory, actor, session). (Rules out "data not written".)read_agent()/read_session()/list_messages()against the live session → all returned the events (after the index had caught up). (Shows the read path is correct when consistent.)Agent(session_manager=sm)exactly as our runtime does →agent.messagesrestored the full history. (Rules out "restore is broken".)read_session/read_agentFOUND,list_messagesreturned all turns. (Proves the failure was a transient read at invoke time, not data/version.)bedrock-agentcore1.14.1 vs 1.15.1 → identicallist_eventsandsession_managerrestore logic. (Rules out "fixed by upgrade".)Net: 2 reproduced failures, multiple successes, same code/data/version → the only variable is the consistency of the metadata-filtered read at invoke time.
Impact
Silent, intermittent loss of conversation context in production multi-turn agents using the documented short-term-memory integration. Hard to detect (no error, no log), and not fixable by upgrading.
Recommended fix
Restoration should not depend on an eventually-consistent metadata-filtered read. Options, in order of preference:
initialize(), afterread_agent(), if it returnsNonebutlist_messages(session_id, agent_id)(unfiltered) returns a non-empty history, treat the agent as existing and replay that history instead of creating a new agent. (Decouples message restore from the flaky agent/session marker lookup.)read_agent()/read_session()for the read-after-write window when a marker event is expected.initialize()takes the "new agent" branch while unfiltered events for the session already exist (i.e. a likely false "new session").A minimal, behavior-preserving version of (1): when the metadata-filtered
read_agentmisses, fall back to the unfilteredlist_messagesto decide existence + restore.Minimal reproduction script
Self-contained; exercises the exact SDK code path (a fresh
AgentCoreMemorySessionManager+Agentper turn, as a stateless runtime invoke would). The failure is consistency/timing dependent, so the script loops over fresh sessions until it catches one; on failure it immediately proves the data exists via the strongly-consistent unfiltered read.Observed signal when it hits:
restored=0 unfiltered=4(turn 1's user+assistant are on disk, but nothing was replayed), matching the productionCreated agent: default in session: <same id>log line on the second turn.Suggested patch (sketch)
Untested sketch of recommendation (1) — decouple restore from the eventually-consistent marker lookup by falling back to the strongly-consistent unfiltered read inside
read_agent():Notes: the reconstructed
SessionAgentloses any persistedstate/conversation_manager_statefor that one turn (they re-sync on the next write), which is strictly better than silently dropping the entire conversation. The same fallback shape applies toread_session(). Alternatively (recommendation 2), a bounded retry with backoff on the filtered read also closes most of the window, at the cost of latency.Current workaround (in our runtime)
After constructing the agent, if
session_manageris attached butagent.messagesis empty, we re-load history via the strongly-consistentsession_manager.list_messages(session_id, agent.agent_id)and assign it toagent.messages:Direct assignment does not enqueue writes (verified
pending_message_count()unchanged), so it does not re-persist or duplicate. This fully and reliably eliminates the symptom.