feat(prompts): Kiro-style agent identity + explicit edit contract#566
Open
JessicaMulein wants to merge 1 commit into
Open
feat(prompts): Kiro-style agent identity + explicit edit contract#566JessicaMulein wants to merge 1 commit into
JessicaMulein wants to merge 1 commit into
Conversation
Rewrite the agent-family system prompts so models follow the editing
contract up front instead of learning it after a failure:
- agent.yml main_system: expert-engineer identity, investigate-before-claiming,
scope discipline, failure-loop recognition, and an explicit editing contract
(ContextManager -> ReadRange -> EditText, one file per EditText call,
@000/000@ markers for empty files). Reference {final_reminders} so
overeager_prompt and the MCP tool_prompt actually reach the agent coder.
- subagent.yml: inherit the agent identity/contract instead of re-overriding
main_system with stale directives; keep only sub-agent-specific finishing
guidance (verbose Yield summary for the parent).
- ask.yml / architect.yml: same direct voice and ground-answers-in-code
discipline; architect plans now name verification steps and edge cases.
Adds tests/basic/test_agent_prompt_contract.py: deterministic, no-LLM checks
that the prompts render via str.format with no stray braces, that
{final_reminders} appears exactly once, that the edit contract is stated, and
that the sub-agent inherits the agent identity.
886097a to
be997ee
Compare
JessicaMulein
added a commit
to Digital-Defiance/BrightVision
that referenced
this pull request
Jun 8, 2026
…ntract
Pin cecli submodule to a653ce9f0 (dev-integration), which carries the
Kiro-style agent prompt rewrite + explicit edit contract and the
{final_reminders}/sub-agent-inheritance fixes (upstream PR cecli-dev/cecli#566).
Adds the BrightVision-side prompt-quality eval harness:
- bright_vision_core/agent_eval.py: objective behavioral scorer reusing the
agent_turn.py signal parsers (edit failures, ReadRange-before-edit, ls-spam,
token limit, rounds) + tests/core/test_agent_eval.py.
- bright_vision_core/agent_judge.py: opt-in LLM-as-judge rubric (scope,
directness, investigation, summary quality) with robust JSON parsing +
tests/core/test_agent_judge.py.
- tests/core/test_agent_prompt_eval.py + 'eval:prompts' script: real-Ollama
behavioral eval scoring one scoped edit turn (E2E_LLM, BV_PROMPT_JUDGE).
- docs: ROADMAP #54, TESTING 'Measuring prompt quality' section; .gitignore for
the regenerated eval workspace.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Rewrites the agent-family system prompts so models follow the editing contract up front instead of learning it only after a failed edit. The previous agent prompt was a thin directive list that (a) encouraged long exploration loops, (b) never stated the #1 cause of failed turns (edit ordering), and (c) had no identity/voice.
Changes
prompts/agent.ymlmain_system— expert-engineer identity; investigate-before-claiming; scope discipline; failure-loop recognition (stop after two failures, diagnose); and an explicit editing contract:ContextManager->ReadRange->EditText, one file perEditTextcall,@000/000@markers for empty files. Drops the loop-encouraging 'no task takes too long' line.{final_reminders}, soovereager_promptand the MCPtool_promptwere silently dropped for the agent coder. Now referenced exactly once.prompts/subagent.yml— now inherits the agent identity/contract instead of re-overridingmain_systemwith stale directives; keeps only the sub-agent-specific verboseYieldsummary.prompts/ask.yml/prompts/architect.yml— same direct voice and ground-answers-in-code discipline; architect plans now name verification steps and edge cases.Test plan
tests/basic/test_a-ent_prompt_contract.py(deterministic, no LLM): prompts render viastr.formatwith no stray braces and no unknown keys;{final_reminders}appears exactly once across main_system+system_reminder; edit contract is stated (ReadRange before EditText, one file, empty-file markers); sub-agent inherits the agentmain_system; legacy 'no task takes too long' directive is gone.pytest tests/basic/test_prompts.py— 27 pass (inheritance chains unchanged).pytest tests/coders/test_copypaste_coder.py tests/subagents/— pass.Notes for reviewers (scope & how to split)
These changes fall into two buckets, and I'm happy to split if you'd prefer to land them separately:
Objective fixes (recommend landing regardless):
{final_reminders}was never referenced in the agent prompt, soovereager_promptand the MCPtool_promptwere silently dropped for the agent coder. Now referenced exactly once.subagent.ymlre-overrodemain_systemwith a stale copy, so sub-agents diverged from the main agent every time it improved. Now inherits.edit_text.py/read_range.pyerror messages (ReadRange-before-EditText, one file per call,@000/000@), so it should reduce failed turns broadly — most on smaller/local models.Opinionated / negotiable:
main_systemgot longer (~2x); for tiny-context local models that's a real token cost. Happy to trim.If you want only the objective bucket, I can drop the persona/voice + length changes into a follow-up and keep this PR to the two bugfixes + the contract block.
Honest scope of validation: the new test proves the prompt states the right things; behavioral improvement was validated end-to-end on one model (qwen3-coder:30b) on one scoped edit task (completed, zero edit failures, ReadRange preceded the edit). It is not a multi-model benchmark.