Русский · English
Multi-agent framework for Claude Code: 58 agents, 46 methodology skills, 16 + 3 Python orchestration scripts, 2 Workflow orchestration engines + 2 LangGraph human-gate engines, tier-aware acceptance (S/M/L), filesystem-isolated adversary review, cross-family second opinion via Codex MCP, human as supreme judge at critical transitions.
v0.3 (2026-06-05): engagement orchestration unified under the engagement-workflow Workflow — the main loop conducts a single pre-gate cascade (plan → deliver in isolated git-worktree waves → validate → handoff → gate) and stops at the handoff seam; the LangGraph human-gate (consilium → directive → manager) remains the acceptance path after the seam. Domain leads are planning-only; specialist coordination is structural — waves in the lead's plan. See
CHANGELOG.md.v0.2.4 (2026-05-28): Windows compatibility — three latent issues surfaced on Max-subscription claude CLI:
claude.CMDnpm-wrapper truncates multiline argv at the first newline (CMD line-parsing),subprocess.run(text=True)decodes UTF-8 Russian as cp1251 on Russian-locale Windows, andconsilium_synth_completedledger emit was passing raw natural verdict to a schema expectingACCEPT/REJECT/DIRECTED. All three fixed across 4 scripts (find_claude_cmd()resolves.CMD→claude.exe; 10 subprocess sites gotencoding="utf-8", errors="replace"; inlineVERDICT_MAPmirror in_make_finalize_node). All--invoker mocktests passed pre-fix; latent risk lived in real subscription mode untested on Windows until now.v0.2.3 (2026-05-28):
engagement_lg.pyend-to-end across all 11 nodes in three execution modes. NEW--mockmode runs the real graph paths but with canned-artefact subprocess wrappers — full end-to-end smoke testing without claude CLI required. Send fan-out to specialists,validator_lg.py+adversary_lg.pysubprocess integration,claude -p --agent {domain}-managerfor acceptance, REJECT_NOW short-circuit, engagement-archive on ACCEPT. 7 end-to-end smoke paths verified on synthetic engagements (S/M/L tiers + REJECT loop + REJECT terminal + dry-run + claude-CLI-absent fail-fast).v0.2.2 (2026-05-28): modular precheck refactor (
handoff-precheck.py1264→423 lines + newscripts/lib/precheck/package, 8 topic-modules)
engagement_lg.pyskeleton (3rd LangGraph engine owning the engagement-level lifecycle from intake to archive,EngagementStatewith 8 node placeholders, 3 HITL pause points, intake/plan nodes wired tosize-detect.py --auto-promote+claude -p --agent {domain}-leadsubprocess). 3 new ledger payload types. WHITELIST drift fix.v0.2.1 (2026-05-28): refinement release — adversary per-role ledger events (
consilium_started/consilium_role_completed), SkillOpt golden-set parity across all 3 domains (dev/design/marketing, 9 scenarios), hot-path optimization viareferences/split in 3 heavily-loaded skills (engagement-protocol / ui-ux-methodology / dev-methodology, −572 lines per engagement load).v0.2 (2026-05-28): acceptor/optimizer split —
*-managerper-engagement acceptor +*-directorsystem-optimizer (SkillOpt loop). Authority invariant, event ledger (engagement/events.jsonl), canonical validator schema, per-engagement reflections. SeeCHANGELOG.mdfor the full delta.
Multi-agent pipelines on a single model family suffer from three systemic failure modes:
| Problem | What goes wrong | How the system handles it |
|---|---|---|
| Framing contamination | The same Claude across multiple roles shares the same blind spots | Adversary runs in a fresh subprocess with a filesystem-curated view — sees only what an external process places there |
| Goodhart on validators | Validators degenerate into format-gates, checking fields instead of thinking | Tier-aware dispatch + cross-family second opinion via Codex (different model lineage = different blind spots) |
| Undifferentiated rigour | A button tweak and a landing redesign go through the same pipeline | S — light human-glance; M — adversary + judge; L — consilium of 5 reviewers + cross-family adjudication |
flowchart TB
H["Human layer<br/>Trigger phrase + supreme judge on M/L acceptance + SkillOpt commons-maintainer"]
A["Agents layer · 58 agents<br/>managers / directors / leads / specialists / validators"]
S["Skills layer · 46 skills<br/>methodologies, protocols, tool guides"]
O["Orchestration layer · 14 + 3 Python scripts<br/>mechanical gates, adversary, consilium, archival, event ledger"]
St["State layer<br/>engagement/ directory · whitelist · append-only logs"]
H <--> A
A <--> S
A <--> O
O <--> St
A <--> St
classDef human fill:#fef3c7,stroke:#d97706,color:#000
classDef agents fill:#dbeafe,stroke:#2563eb,color:#000
classDef skills fill:#dcfce7,stroke:#16a34a,color:#000
classDef orch fill:#fce7f3,stroke:#db2777,color:#000
classDef state fill:#e9d5ff,stroke:#9333ea,color:#000
class H human
class A agents
class S skills
class O orch
class St state
Each layer has a clear scope of responsibility. Layers don't substitute for each other: agents don't write scripts, scripts don't make judgments, humans don't do routine validation.
Detailed description of each layer and their interactions —
ARCHITECTURE.md.
Tier-aware acceptance. Each engagement is classified at intake into one of three tiers:
| Tier | Use case | Adversary | Manager (acceptor) | Mechanical checks |
|---|---|---|---|---|
| S | Hotfix, button tweak, single deliverable | None — human glance | None | 6 |
| M | Feature, landing, dashboard, multi-specialist | 1× peer-opus | Judge mode | 13 |
| L | Rebrand, multi-wave, cross-domain | 5× consilium | Judge + adjudication | 21 |
Adversary in filesystem-isolated subprocess. Two-pass design against framing contamination:
- Pass 1 (Blind). Adversary sees a curated copy of
engagement/withouthandoff.md, without acceptance-log, without other reviewers. Forms preliminary findings without contamination. - Pass 2 (Informed). Adversary receives full state plus its own preliminary findings injected via prompt. Confirms, refines, or retracts findings. Delta preliminary→final is a contamination signal.
L-tier consilium. 5 reviewers in parallel: Anthropic Opus + 2× OpenAI GPT-5 (Codex) + Anthropic Sonnet + Anthropic Haiku. Cross-family disagreements are detected automatically and flagged for manual review.
Manager as judge, not sweep-runner. On M/L the manager (per-engagement
acceptor — *-manager agent, ex-director) issues a verdict per directive
with explicit adjudication on every disagreement between adversary and
author. Doesn't dispatch, doesn't edit content, doesn't re-run validators.
Adjudication completeness is enforced mechanically — every finding must
have a decision marker.
Director as system-optimizer (out-of-band). The *-director role
(repurposed in v0.2) runs a SkillOpt-style skill-evolution loop on
accumulated REJECT / rework signals from skill-evolution-log.md. Fires
only at ≥3 same-class signals clustered by target × class
(rule_missing / rule_wrong / rule_ignored). Cycle:
- Reflect — director clusters manager-emitted signals by target +
class, reads
skill-rejected-edits.md(negative memory). - Codex proposes bounded edits — cross-family (kills defend-bias), budget L: 4–6 patches per cycle, ≤10 lines each.
- Golden-set gate — director verifies the edit doesn't regress any
scenario in
system-optimization-protocol/golden/{domain}/(3 scenarios per domain × 3 domains = 9 total). - Promote or reject — passing edits land in the corpus; rejected
edits append to
skill-rejected-edits.mdwith reason (read before next cycle).
Judge-only — never authors edits itself. Never per-engagement. The human is commons-maintainer for cross-domain promotions.
Authority invariant. When sources of behavior disagree, a written
7-rule precedence resolves it (CLAUDE.md > judge decision > criteria.md >
PROTOCOL > METHODOLOGY > agent body > frontmatter). Unresolved conflicts
become blocking authority_conflict events.
Event ledger. Every M/L engagement appends lifecycle events to
engagement/events.jsonl (append-only, per-engagement). Schema v1
captures phase transitions, validator runs, interrupts, verdicts,
reflections, authority conflicts. Read at any time via
scripts/lib/ledger.py.
Human as supreme judge. Between consilium synthesis and director
verdict the human gets a chat-ready summary (≤2 minutes to read) and
responds in one of three forms: PROCEED / REJECT: <reason> /
DIRECTED: <what to change>. No 200 lines of markdown — the system
formats and expands it.
Mechanical safety baseline. Exit-code gates run at every transition:
danger-scan (DROP / force-push / prod-deploy registry),
handoff-precheck (tier-aware structural verification),
handoff-paths-check (phantom path detection),
director-verdict-check (adjudication completeness),
preflight (tools availability).
Audit trail by FS state. Engagement = directory. State is read from
files: iteration, validation-log.md, validation-outputs/*.json,
consilium-summary.md, human-directive.md, acceptance-log.md.
No databases, no external logs — cat reconstructs the picture
completely.
sequenceDiagram
autonumber
participant U as Human
participant ML as Main loop · agency-intake
participant WF as engagement-workflow · Workflow
participant SP as Specialists · waves
participant V as Validators
participant SC as LangGraph + scripts
participant M as Manager · acceptor
U->>ML: trigger phrase
ML->>ML: classify → criteria.md (S/M/L)
ML->>WF: invoke engagement-workflow
WF->>WF: discovery · lead:plan → tasks / waves / validators
WF->>SP: deliver — specialist waves in git worktrees (per-task review→rework)
SP-->>WF: executor-reports/ + consolidated work
WF->>V: validate — validators in parallel + adversarial-verify
V-->>WF: validation-outputs/*.json (canonical envelope)
WF->>WF: handoff.md + handoff-precheck (gate)
WF-->>ML: readyForAcceptance — handoff seam
Note over ML,SC: seam · pre-gate = Workflow | human-gate = LangGraph
alt M/L tier
ML->>SC: adversary_lg.py --consilium {M|L} --interrupt
SC->>U: consilium summary (chat, ≤2 min)
U->>SC: PROCEED / REJECT / DIRECTED → human-directive.md
ML->>M: invoke {domain}-manager (judge mode)
M->>M: acceptance-log.md + 0–3 reflections
else S tier
Note over U: human glance — accept directly
end
ML->>SC: engagement-archive.py (on ACCEPT)
S-tier skips adversary, consilium and manager phase: producer self-attests, mechanical checks gate, human accepts directly.
| Category | Count | Roles |
|---|---|---|
| Managers | 3 | dev-manager, design-manager, marketing-manager — per-engagement acceptor (judge between producer + adversary) |
| Directors | 3 | dev-director, design-director, marketing-director — out-of-band system-optimizer (SkillOpt loop) |
| Leads | 3 | dev-lead, design-lead, marketing-lead — planning-only (the engagement-workflow's lead:plan step; they plan waves, the Workflow dispatches specialists) |
| Specialists | 20 | backend, frontend, fullstack, devops, qa, tech-architect, product-analyst, technical-writer; ux, ui, visual, brand-strategist, presentation; copywriter, banner-designer, seo, ppc, keyword-researcher, web-analyst, ai-visibility |
| Validators | 29 | code-reviewer, security-auditor, accessibility, performance, migration, test-reviewer, reality-checker, skeptic, completeness, task/tech-spec/user-spec validators, infra/deploy reviewers, pre/post-deploy QA, anti-pattern detector, ux-review, skill-checker, 3 researchers (code/brand/design-system), product-context-validator, etc. |
| Category | Count | What's in it |
|---|---|---|
| Agency protocol | 8 | agency-intake, engagement-protocol, engagement-contract (specialist subset), acceptance-protocol (per-engagement acceptor methodology), system-optimization-protocol (SkillOpt loop), validation-pipeline, docs-pipeline, codex-bridge |
| Dev methodology | 16 | TDD, code review, spec planning (user/tech), task decomposition, deploy, security, infrastructure, prompt engineering, persistent tasks, pre/post-deploy QA |
| Design methodology | 8 | brand, design system, UI/UX, presentation, banner, design tokens |
| Marketing methodology | 5 | SEO auditing, semantic drift, AI visibility, task decomposition, benchmark research (industry reverse-engineering, standalone entry-point) |
| Regional SEO/PPC stack | 6 | API integrations for Russian-market analytics platforms (Webmaster, Metrika, Direct, Wordstat, Search) |
| Skill development | 3 | skill authoring, test design, testing |
Frontmatter tags for the router: [PROTOCOL], [METHODOLOGY], [TOOL].
Two Workflow orchestration engines (workflows/):
engagement-workflow.js— the pre-gate cascade the main loop conducts: discovery (lead:plan) → decompose (gated) → deliver (specialist waves in isolated git worktrees, per-task review→rework, per-wave consolidation: code = octopus-merge / artefact = manifest-verify) → validate (validators in parallel + adversarial-verify each finding) → handoff → gate. Stops at the handoff seam; a wave hard-stops if a task is blocked / fails review / the plan is malformed (no silent proceed). Resumes via the Workflow run journal (resumeFromRunId).skillopt-workflow.js— the director SkillOpt cycle as a Workflow (harvest due signals → Codex proposes bounded edits → golden-set gate → promote / reject).
Two LangGraph engines (the human-gate, after the seam):
adversary_lg.py— LangGraph adversary bridge: 5 reviewer roles, two-pass curated-view isolation,Send-based parallel fan-out, SQLite-checkpointed--resume, native HITL viainterrupt(), event ledger wiredvalidator_lg.py— LangGraph atomic-validator fan-out viaSend; retry edge, auto-plan from criteria.md predicates,--resume, native HITL via--interrupt-on-critical, canonical validator envelope, event ledger wired
Mechanical gates and synthesis:
consilium-synth.py— adversary output aggregation, two-stage dedupconsilium-present.py— chat-ready format with decision menudirector-verdict-check.py— mechanical adjudication completeness (legacy name; targets manager verdict in v0.2)handoff-precheck.py— hard-gate tier dispatch (S=6 / M=13 / L=21 checks), event ledger wiredhuman-directive.py— scaffold human-directive.md from CLI argspreflight.py— tools availability checkdanger-scan.py— registry of dangerous operationshandoff-paths-check.py— phantom path detectioncross-val-check.py— verbatim quote verificationtrace-schema-check.py— trace JSON schema + stalenesssize-detect.py— tier detection at intake / runtime, with--auto-promoteengagement-archive.py— idempotent archival
Shared libraries:
lib/ledger.py— append-only event ledger (engagement/events.jsonl); 28 known payload types; thin shim; smoke-testedlib/precheck/— modular precheck package (v0.2.2): 8 topic modules (common,criteria,handoff,iteration,validators,acceptance,danger+__init__re-exports).handoff-precheck.py(1264 → 423 lines, CLI/dispatch only) imports from this package. Byte-identical JSON output to the pre-refactor monolith.
Plus optional/ — opt-in utilities outside the core protocol
(engagement-doctor.py, engagement-migrate.py, token-budget.py;
see scripts/optional/README.md).
The director-optimizer uses golden scenarios as a regression gate before promoting any Codex-proposed edit. One set per domain, 3 scenarios each covering the three failure classes:
| Domain | Scenarios | Failure classes |
|---|---|---|
golden/dev/ |
spec-code-drift / flaky-test-masking / security-gap | rule_ignored / rule_missing / rule_wrong |
golden/design/ |
design-token-drift / accessibility-aria-missing / dark-mode-contrast-fail | rule_ignored / rule_missing / rule_wrong |
golden/marketing/ |
keyword-count-underdelivery / seo-claim-unsupported / brand-voice-pronoun-violation | rule_ignored / rule_missing / rule_wrong |
A real SkillOpt cycle fires only when ≥3 real same-class signals
accumulate in skill-evolution-log.md. A synthetic dry-run on the dev
domain (Codex proposed 3 edits, the judge accepted 2, 1 entered
skill-rejected-edits.md) is documented in v0.2 and validates the loop
mechanics end-to-end.
- Claude Code
- Codex
- Python 3.10+
- (Optional) Yandex API tokens — for marketing skills (Webmaster, Metrika, Direct, Wordstat, Search)
-
Clone the repository:
git clone https://github.com/AgentShekel/agentic-workflow.git cd agentic-workflow -
Copy contents to
~/.claude/:cp -r agents/* ~/.claude/agents/ cp -r skills/* ~/.claude/skills/ cp -r scripts/* ~/.claude/scripts/
(On Windows — corresponding paths in
%USERPROFILE%\.claude\.) -
Configure Codex MCP:
cp .mcp.json.example .mcp.json
Set the absolute path to the
codexCLI. -
(Optional) Configure Yandex API:
cp .env.example .env
Fill in tokens if you use marketing skills.
-
Restart Claude Code — verify that MCP tools are visible.
Entry point — trigger phrase in chat. Both English and Russian are recognized out of the box:
agency task: <description>
or
мне надо агенси задачу <description>
Standalone capabilities have separate triggers:
мне надо провести исследование/benchmark research— invokesbenchmark-researchskill (industry reverse-engineering).прогнать skill-evolution/skill evolution cycle— invokes the matching domain director to run the SkillOpt cycle on accumulated signals.
Add or adjust phrasings in the agency-intake skill's Use when:
list to match your team's vocabulary.
The system then autonomously runs the engagement through all layers. On M/L you get a chat summary with a decision menu — respond with a short verdict.
Detailed flow and role of each layer —
ARCHITECTURE.md.
MIT (see LICENSE)