Skip to content

improvement(mcp): bound MCP memory and lifecycle concurrency#4751

Open
icecrasher321 wants to merge 2 commits into
stagingfrom
improvement/mcp-mem-perf
Open

improvement(mcp): bound MCP memory and lifecycle concurrency#4751
icecrasher321 wants to merge 2 commits into
stagingfrom
improvement/mcp-mem-perf

Conversation

@icecrasher321
Copy link
Copy Markdown
Collaborator

Summary

MCP memory load caused high memory on ecs task and almost killed it

Enforce memory bounds similar to rest of platform

Type of Change

  • Other: Performance

Testing

Tested manually

Checklist

  • Code follows project style guidelines
  • Self-reviewed my changes
  • Tests added/updated and passing
  • No new warnings introduced
  • I confirm that I have read and agree to the terms outlined in the Contributor License Agreement (CLA)

@vercel
Copy link
Copy Markdown

vercel Bot commented May 27, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped May 27, 2026 3:02am

Request Review

@cursor
Copy link
Copy Markdown

cursor Bot commented May 27, 2026

PR Summary

High Risk
Touches authentication bridging, workflow execute, and large payload paths across MCP serve and management APIs—mistakes could break tool calls or leak trust boundaries.

Overview
This PR hardens MCP and workflow execution against unbounded memory and races by adding 10MB-class limits on request/response bodies, tool metadata, and serialized tool results, with streaming reads that cancel on oversize or client disconnect (413 / 499).

The workflow MCP serve route (/api/mcp/serve/[serverId]) gains bounded JSON-RPC parsing, paginated tools/list, duplicate tool name detection (409), clearer transport/auth behavior (public GET metadata, SSE 405, DELETE always authenticated), and internal JWT bridge calls to workflow execute instead of forwarding client API keys. tools/call propagates upstream HTTP statuses, preserves falsy outputs, and aborts in-flight workflow fetches when the MCP client disconnects.

Workflow execute applies the same body limits, treats only trusted internal JWT + bridge headers as MCP bridge traffic (ignoring spoofed bridge headers on API keys), rejects large inline MCP outputs without large-value refs, and improves client-cancel handling for sync and SSE paths.

Management MCP routes use shared readMcpJsonBodyWithLimit / mcpBodyReadErrorResponse; refresh discovery is capped and concurrency-limited. Workflow MCP lifecycle adds per-server/per-workflow tool counts, metadata budgets, transactional advisory locks, and stricter validation on create/update/delete. Connection manager adds connect timeouts, jittered reconnect, and fixes cleanup so intentional disconnects do not reconnect. Execution payload compaction can reject oversized values instead of spilling when configured for MCP responses.

Reviewed by Cursor Bugbot for commit 54af13e. Configure here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 27, 2026

Greptile Summary

This PR addresses high memory consumption on ECS tasks caused by unbounded MCP tool metadata and connection lifecycles. It introduces a comprehensive set of memory and concurrency controls across the MCP subsystem.

  • Memory bounds: New constants.ts defines hard limits (10 MB total metadata, 2 MB parameter schemas, 64 KB descriptions, 256 B tool names per server); tool-limits.ts enforces these at write time and during schema sync; middleware.ts caps management request bodies at 10 MB.
  • Lifecycle concurrency: server-locks.ts adds Postgres advisory locking around all workflow MCP tool mutations; workflow-mcp-lifecycle.ts is a new orchestration layer that validates budgets, acquires locks, and does optimistic-concurrency snapshot checks before committing.
  • Connection manager: connection-manager.ts caps persistent connections at 50, evicts idle connections after 30 minutes, and retries disconnects with exponential backoff.

Confidence Score: 4/5

Safe to merge; all findings are non-blocking design notes rather than correctness failures in the happy path.

The core locking and budget enforcement logic is well-structured and thoroughly tested. The two areas needing attention — a stale nextCursor derived from a pre-flight size query rather than actual returned data, and silent skipping of tools on servers added after lock acquisition — are edge cases bounded by MAX_MCP_TOOLS_PER_SERVER enforcement and short transaction windows.

apps/sim/app/api/mcp/serve/[serverId]/route.ts (two-phase query and nextCursor derivation) and apps/sim/lib/mcp/workflow-mcp-sync.ts (silent skip for tools on servers added after lock acquisition)

Important Files Changed

Filename Overview
apps/sim/app/api/mcp/serve/[serverId]/route.ts Adds size-bounded tools/list pagination and DELETE endpoint; two-phase DB query creates a TOCTOU gap where nextCursor can be stale relative to actual response data
apps/sim/lib/mcp/workflow-mcp-sync.ts Adds paged sync with per-server metadata budget enforcement; tools on servers added after lock acquisition are silently skipped without logging
apps/sim/lib/mcp/orchestration/workflow-mcp-lifecycle.ts New orchestration layer with advisory locking, snapshot comparison, and tool metadata budget enforcement on create/update/delete; logic is well-guarded
apps/sim/lib/mcp/server-locks.ts New advisory lock helper using pg_advisory_xact_lock with lock_timeout via sql.raw template literal; safe for current constant but antipattern
apps/sim/lib/mcp/tool-limits.ts New utility for per-tool and per-server metadata byte budget calculations; logic is clean with proper DB-side and application-side size checks
apps/sim/lib/mcp/constants.ts New file establishing memory limit constants for MCP tools (10 MB metadata, 2 MB schemas, 256 B names, 64 KB descriptions)
apps/sim/lib/mcp/connection-manager.ts New persistent-connection manager with 50-connection cap, idle-timeout eviction (30 min), exponential backoff reconnect (10 attempts), and pub/sub integration
apps/sim/lib/mcp/middleware.ts Adds 10 MB request body limit to MCP management endpoints with proper abort/payload-too-large/invalid-JSON error handling
apps/sim/lib/core/utils/stream-limits.ts Extended with readResponseToBufferWithLimit and readResponseJsonWithLimit helpers; abort and size-limit paths are well handled

Sequence Diagram

sequenceDiagram
    participant Client as MCP Client
    participant Serve as serve/[serverId]/route
    participant Lifecycle as workflow-mcp-lifecycle
    participant Sync as workflow-mcp-sync
    participant Locks as server-locks
    participant DB as Postgres

    Client->>Serve: POST tools/list (cursor?)
    Serve->>DB: SELECT sizes WHERE serverId + pageCondition (query 1)
    DB-->>Serve: tool sizes (pre-flight budget check)
    Serve->>DB: SELECT tool data WHERE serverId + pageCondition (query 2)
    DB-->>Serve: tool rows
    Serve-->>Client: ListToolsResult + nextCursor (from query 1)

    Client->>Serve: POST tools/call
    Serve->>DB: SELECT tool by name
    Serve->>DB: POST /api/workflows/:id/execute
    DB-->>Serve: execution result
    Serve-->>Client: CallToolResult

    Note over Lifecycle,DB: Create/Update/Delete tool
    Lifecycle->>DB: BEGIN TRANSACTION
    Lifecycle->>Locks: SET LOCAL lock_timeout
    Lifecycle->>DB: SELECT workflow FOR UPDATE
    Lifecycle->>Locks: pg_advisory_xact_lock(serverId hash)
    Lifecycle->>DB: validateServerToolMetadataBudget
    Lifecycle->>DB: INSERT/UPDATE/DELETE workflowMcpTool
    Lifecycle->>DB: COMMIT

    Note over Sync,DB: Schema sync on deploy
    Sync->>DB: BEGIN TRANSACTION
    Sync->>DB: collectWorkflowMcpToolServerIds (paginated)
    loop per server
        Sync->>Locks: pg_advisory_xact_lock(serverId hash)
    end
    Sync->>DB: getMcpServerToolMetadataUsageRows per server
    loop per tool page
        Sync->>DB: UPDATE workflowMcpTool SET parameterSchema (budget-aware)
    end
    Sync->>DB: COMMIT
Loading

Reviews (1): Last reviewed commit: "update db mock" | Re-trigger Greptile

Comment on lines 530 to +635
@@ -302,10 +624,32 @@ async function handleToolsList(id: RequestId, serverId: string): Promise<NextRes
},
}
}),
...(nextCursor && { nextCursor }),
}

return NextResponse.json(createResponse(id, result))
return createJsonRpcResponseWithLimit(
id,
result,
MAX_MCP_TOOLS_LIST_RESPONSE_BYTES,
'MCP tools/list response'
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Two-phase DB query creates TOCTOU gap in pagination cursor

handleToolsList performs two separate DB queries with the same filter: the first fetches sizes for the pre-flight budget check and derives nextCursor, while the second fetches the full tool data. If tools are inserted or deleted between queries, pageTools (second query) can include different rows than pageSizes (first query). In the insertion case, rows added between queries will appear in the second query but weren't counted in the first, so pageTools.slice(0, MAX_MCP_TOOLS_LIST_COUNT) could silently drop the last tool of the intended page, and since nextCursor = undefined was determined by the first query, the client never learns there are more results. Given that MAX_MCP_TOOLS_PER_SERVER is 100, this window is small, but deriving nextCursor from the second query's results (the data actually returned) would be safer.

Comment on lines +220 to +225

for (const [serverId, serverTools] of [...toolsByServer].sort(([left], [right]) =>
left.localeCompare(right)
)) {
const usageState = usageStateByServer.get(serverId)
if (!usageState) continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent skip for tools on servers added after lock acquisition

collectWorkflowMcpToolServerIds is called first to collect server IDs and acquire advisory locks; usageStateByServer is then populated for exactly those servers. The subsequent paged sync loop checks const usageState = usageStateByServer.get(serverId) and continues if absent. Because Postgres READ COMMITTED gives each statement a fresh snapshot, a tool on a newly-added server could appear in the sync loop that wasn't present during lock collection, causing its schema to be silently skipped this sync cycle. The tool will be corrected on the next sync, but the miss is not logged — making it hard to diagnose stale schemas when debugging.

Comment on lines +8 to +10
export async function setWorkflowMcpTransactionLockTimeout(tx: DbOrTx): Promise<void> {
await tx.execute(sql.raw(`SET LOCAL lock_timeout = '${MCP_SERVER_LOCK_TIMEOUT_MS}ms'`))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 sql.raw with a template literal for a constant value

SET LOCAL lock_timeout cannot be parameterized in Postgres so sql.raw is technically correct here, but using a JS template literal with sql.raw is generally an antipattern in Drizzle ORMs. Since MCP_SERVER_LOCK_TIMEOUT_MS is a numeric constant there is no injection risk today, but the pattern makes it easy to accidentally introduce one if this is later refactored to accept a dynamic timeout.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant