improvement(mcp): bound MCP memory and lifecycle concurrency#4751
improvement(mcp): bound MCP memory and lifecycle concurrency#4751icecrasher321 wants to merge 2 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
PR SummaryHigh Risk Overview The workflow MCP serve route ( Workflow execute applies the same body limits, treats only trusted internal JWT + bridge headers as MCP bridge traffic (ignoring spoofed bridge headers on API keys), rejects large inline MCP outputs without large-value refs, and improves client-cancel handling for sync and SSE paths. Management MCP routes use shared Reviewed by Cursor Bugbot for commit 54af13e. Configure here. |
Greptile SummaryThis PR addresses high memory consumption on ECS tasks caused by unbounded MCP tool metadata and connection lifecycles. It introduces a comprehensive set of memory and concurrency controls across the MCP subsystem.
Confidence Score: 4/5Safe to merge; all findings are non-blocking design notes rather than correctness failures in the happy path. The core locking and budget enforcement logic is well-structured and thoroughly tested. The two areas needing attention — a stale nextCursor derived from a pre-flight size query rather than actual returned data, and silent skipping of tools on servers added after lock acquisition — are edge cases bounded by MAX_MCP_TOOLS_PER_SERVER enforcement and short transaction windows. apps/sim/app/api/mcp/serve/[serverId]/route.ts (two-phase query and nextCursor derivation) and apps/sim/lib/mcp/workflow-mcp-sync.ts (silent skip for tools on servers added after lock acquisition) Important Files Changed
Sequence DiagramsequenceDiagram
participant Client as MCP Client
participant Serve as serve/[serverId]/route
participant Lifecycle as workflow-mcp-lifecycle
participant Sync as workflow-mcp-sync
participant Locks as server-locks
participant DB as Postgres
Client->>Serve: POST tools/list (cursor?)
Serve->>DB: SELECT sizes WHERE serverId + pageCondition (query 1)
DB-->>Serve: tool sizes (pre-flight budget check)
Serve->>DB: SELECT tool data WHERE serverId + pageCondition (query 2)
DB-->>Serve: tool rows
Serve-->>Client: ListToolsResult + nextCursor (from query 1)
Client->>Serve: POST tools/call
Serve->>DB: SELECT tool by name
Serve->>DB: POST /api/workflows/:id/execute
DB-->>Serve: execution result
Serve-->>Client: CallToolResult
Note over Lifecycle,DB: Create/Update/Delete tool
Lifecycle->>DB: BEGIN TRANSACTION
Lifecycle->>Locks: SET LOCAL lock_timeout
Lifecycle->>DB: SELECT workflow FOR UPDATE
Lifecycle->>Locks: pg_advisory_xact_lock(serverId hash)
Lifecycle->>DB: validateServerToolMetadataBudget
Lifecycle->>DB: INSERT/UPDATE/DELETE workflowMcpTool
Lifecycle->>DB: COMMIT
Note over Sync,DB: Schema sync on deploy
Sync->>DB: BEGIN TRANSACTION
Sync->>DB: collectWorkflowMcpToolServerIds (paginated)
loop per server
Sync->>Locks: pg_advisory_xact_lock(serverId hash)
end
Sync->>DB: getMcpServerToolMetadataUsageRows per server
loop per tool page
Sync->>DB: UPDATE workflowMcpTool SET parameterSchema (budget-aware)
end
Sync->>DB: COMMIT
Reviews (1): Last reviewed commit: "update db mock" | Re-trigger Greptile |
| @@ -302,10 +624,32 @@ async function handleToolsList(id: RequestId, serverId: string): Promise<NextRes | |||
| }, | |||
| } | |||
| }), | |||
| ...(nextCursor && { nextCursor }), | |||
| } | |||
|
|
|||
| return NextResponse.json(createResponse(id, result)) | |||
| return createJsonRpcResponseWithLimit( | |||
| id, | |||
| result, | |||
| MAX_MCP_TOOLS_LIST_RESPONSE_BYTES, | |||
| 'MCP tools/list response' | |||
| ) | |||
There was a problem hiding this comment.
Two-phase DB query creates TOCTOU gap in pagination cursor
handleToolsList performs two separate DB queries with the same filter: the first fetches sizes for the pre-flight budget check and derives nextCursor, while the second fetches the full tool data. If tools are inserted or deleted between queries, pageTools (second query) can include different rows than pageSizes (first query). In the insertion case, rows added between queries will appear in the second query but weren't counted in the first, so pageTools.slice(0, MAX_MCP_TOOLS_LIST_COUNT) could silently drop the last tool of the intended page, and since nextCursor = undefined was determined by the first query, the client never learns there are more results. Given that MAX_MCP_TOOLS_PER_SERVER is 100, this window is small, but deriving nextCursor from the second query's results (the data actually returned) would be safer.
|
|
||
| for (const [serverId, serverTools] of [...toolsByServer].sort(([left], [right]) => | ||
| left.localeCompare(right) | ||
| )) { | ||
| const usageState = usageStateByServer.get(serverId) | ||
| if (!usageState) continue |
There was a problem hiding this comment.
Silent skip for tools on servers added after lock acquisition
collectWorkflowMcpToolServerIds is called first to collect server IDs and acquire advisory locks; usageStateByServer is then populated for exactly those servers. The subsequent paged sync loop checks const usageState = usageStateByServer.get(serverId) and continues if absent. Because Postgres READ COMMITTED gives each statement a fresh snapshot, a tool on a newly-added server could appear in the sync loop that wasn't present during lock collection, causing its schema to be silently skipped this sync cycle. The tool will be corrected on the next sync, but the miss is not logged — making it hard to diagnose stale schemas when debugging.
| export async function setWorkflowMcpTransactionLockTimeout(tx: DbOrTx): Promise<void> { | ||
| await tx.execute(sql.raw(`SET LOCAL lock_timeout = '${MCP_SERVER_LOCK_TIMEOUT_MS}ms'`)) | ||
| } |
There was a problem hiding this comment.
sql.raw with a template literal for a constant value
SET LOCAL lock_timeout cannot be parameterized in Postgres so sql.raw is technically correct here, but using a JS template literal with sql.raw is generally an antipattern in Drizzle ORMs. Since MCP_SERVER_LOCK_TIMEOUT_MS is a numeric constant there is no injection risk today, but the pattern makes it easy to accidentally introduce one if this is later refactored to accept a dynamic timeout.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
MCP memory load caused high memory on ecs task and almost killed it
Enforce memory bounds similar to rest of platform
Type of Change
Testing
Tested manually
Checklist