[core] agentic: on_failure: start working on agentic troubleshooting#70
[core] agentic: on_failure: start working on agentic troubleshooting#70kpouget wants to merge 19 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a LangChain-backed "on failure" agent that discovers ChangesOn Failure Analysis Agent
FAILURE generation, notifications, and CI wiring
Sequence DiagramsequenceDiagram
rect rgba(173, 216, 230, 0.5)
note over CICommand,ArtifactDir: CI execution and FAILURE artifact generation
CICommand->>runtime_py: execute_tasks()
runtime_py->>ArtifactDir: write FAILURE file (_generate_failure_file_for_agent)
end
rect rgba(255, 218, 185, 0.5)
note over agent_review_on_failure,ChatOpenAI: On-failure LLM analysis
CICommand->>agent_review_on_failure: non-zero exit or exception
agent_review_on_failure->>run_on_failure_agent: run_on_failure_agent(base_artifact_dir)
run_on_failure_agent->>Vault: load_model_config(psap-models-corp-rh)
Vault-->>run_on_failure_agent: model credentials
run_on_failure_agent->>ChatOpenAI: create_llm_client(model_config)
run_on_failure_agent->>process_failure_analysis: process_failure_analysis(dir, llm)
process_failure_analysis->>ArtifactDir: find_failure_files()
ArtifactDir-->>process_failure_analysis: FAILURE paths
process_failure_analysis->>execute_query_sequence: FailureAnalysisQueries + llm
execute_query_sequence->>ChatOpenAI: 5x llm.invoke(HumanMessage)
ChatOpenAI-->>execute_query_sequence: analysis responses
execute_query_sequence-->>process_failure_analysis: structured results
process_failure_analysis->>ArtifactDir: write FAILURE_REVIEW_* + failure_analysis_report.html
end
rect rgba(144, 238, 144, 0.5)
note over submit_py,Slack: Notification delivery
submit_py->>ArtifactDir: write NOTIFICATION.html + NOTIFICATION.md
send_py->>ArtifactDir: read FAILURE_REVIEW_* + NOTIFICATION.html + NOTIFICATION.md
send_py->>Slack: post message with failure analysis and MLflow links
end
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1005-1077: The code currently trusts LLM-provided paths and
patterns (resolved_files, resolved_path, safe_filename, rglob) which allows
directory traversal and glob injection; fix by resolving each candidate path to
an absolute Path via resolved_path.resolve(strict=False) and verifying it is
inside base_artifact_dir.resolve() (e.g., compare parents or use
is_relative_to), and reject/skip any path that is outside the artifact tree
(return an error or mark as not found). Stop passing model-controlled patterns
into rglob: instead build a safe index once from base_artifact_dir (enumerate
base_artifact_dir.rglob("*") to collect actual artifact file paths and names)
and only allow fallback matches by exact filename lookup against that enumerated
list; do not perform wildcard rglob with user input. Ensure any
rejected/unauthorized requests are logged and not read/sent back to the LLM.
- Around line 802-815: Update the prompt construction in on_failure.py so the
initial user-facing prompt always requests the five-field schema (Root Cause,
Failed Step, Trigger, Fix, Prevent) on the very first turn; modify the logic in
the functions that build or call get_failure_explanations() (and the related
prompt text around lines referenced near 802-815 and 878-884) to prepend or
include an explicit instruction asking for the 5-field schema and the
NEED_MORE_FILES line format whenever FAILURE + _ansible.log are present,
ensuring extract_structured_analysis() receives those structured fields
populated.
- Around line 741-746: process_failure_analysis() currently writes every HTML
report to the constant name html_filename = "failure_analysis_report.html",
causing subsequent reports to overwrite previous ones; change the filename
generation to be unique per failure (e.g., include a unique failure identifier
or a timestamp/uuid) when building html_filename/html_path so each call writes
to a distinct file (reference html_filename, html_path, and the
process_failure_analysis() call site to ensure the returned/stored html_report
path matches the new unique name).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a9bedd50-d492-43e7-b8a3-7c8641712fc9
📒 Files selected for processing (1)
projects/core/agentic/on_failure.py
b117cb0 to
476e171
Compare
|
/test fournos llm_d |
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (4)
projects/core/agentic/on_failure.py (4)
982-1007:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAsk for the 5-field schema on the first turn too.
When the model can answer from
FAILURE+ log content without requesting additional files, this prompt never asks forRoot Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed intoextract_structured_analysis()at line 1093. In that path,get_failure_explanations()will usually return empty structured fields even though the public API promises structured output.💡 Suggested fix
Add the 5-field schema request to the initial prompt:
## TASK: 1. Find failure patterns: - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks - In execution logs: Look for "----- FAILED ----" patterns or error indicators - Identify which specific tasks failed and why 2. What step broke. Why. 3. What action caused failure. -4. Need more files? Use EXACTLY this format: +4. Provide structured analysis in this format: + +1. **Root Cause**: One sentence. What broke. +2. **Failed Step**: Which step. Why it failed. +3. **Trigger**: Specific action/config that caused failure. +4. **Fix**: What to change. +5. **Prevent**: How to avoid next time. + +5. Need more files? Use EXACTLY this format:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial prompt assigned to prompt_content should explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so downstream extract_structured_analysis() and get_failure_explanations() receive structured fields; modify the string built in prompt_content inside on_failure.py to append a short line asking for the five named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact field names match what extract_structured_analysis() and get_failure_explanations() expect.
1219-1299:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winConstrain model-requested paths to the artifact tree before reading them.
The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into
rglob(). A prompt-injected request such as../../../../etc/passwdor*.pemcan therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outsidebase_artifact_dir, and fall back only to exact matches from the enumerated artifact files.🔒 Recommended fix
+ # Resolve and validate path is within artifact tree + try: + resolved_abs = resolved_path.resolve() + artifact_abs = base_artifact_dir.resolve() + if not resolved_abs.is_relative_to(artifact_abs): + logger.warning(f"❌ Rejected path outside artifact tree: {filename}") + file_contents[filename] = "❌ Unauthorized: File path outside artifact directory" + continue + except Exception as e: + logger.warning(f"Path resolution failed for {filename}: {e}") + file_contents[filename] = f"❌ Path resolution error: {e}" + continue + # First try the resolved path if resolved_path.exists(): found_path = resolved_path else: - # Search in the base artifact directory recursively (silent search) + # Search only in pre-enumerated artifact files (no glob patterns) try: - # Use a safe search pattern - escape any problematic characters - safe_filename = filename.replace("**", "*") # Replace double asterisks - for search_path in base_artifact_dir.rglob(safe_filename): - if search_path.is_file(): - found_path = search_path - break - - # If no match with pattern, try exact filename search - if not found_path: - for search_path in base_artifact_dir.rglob("*"): - if search_path.is_file() and search_path.name == filename: - found_path = search_path - break + # Only allow exact filename matches from available_files list + for available_file in available_files: + if Path(available_file).name == filename: + candidate = base_artifact_dir / available_file + if candidate.exists() and candidate.is_file(): + found_path = candidate + break except Exception as search_error: logger.warning(f"Search error for '{filename}': {search_error}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The LLM-controlled filenames (handled in resolved_files, resolved_path, and later rglob search) can escape the artifact tree; fix by canonicalizing and constraining every candidate: when building resolved_path (and before any rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is within base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or skipping any path that does not lie under base_artifact_dir; remove use of user-controlled patterns in rglob (do not pass raw filename patterns to base_artifact_dir.rglob), instead pre-enumerate all files under base_artifact_dir into a safe map (path.resolve() values and names) and fall back only to exact matches from that enumerated set; log and return an error or skip entries that are outside the artifact tree and never read files outside base_artifact_dir into file_contents.
914-921:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a unique report filename per failure.
process_failure_analysis()calls this once perFAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlierhtml_reportpaths end up pointing at the last one written.📝 One simple fix
- html_filename = "failure_analysis_report.html" + safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/")) + html_filename = f"failure_analysis_report_{safe_failure_dir}.html"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 914 - 921, process_failure_analysis currently writes every failure report to the constant "failure_analysis_report.html" (variables html_filename/html_path), causing later reports to overwrite earlier ones; change it to generate a unique filename per invocation (for example include the failure identifier, timestamp, or a uuid in the filename) before constructing html_path in the same block so each call creates a distinct file and return the new unique path string from process_failure_analysis.
205-218:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisabling TLS verification exposes the LLM client to man-in-the-middle attacks.
Setting
verify=Falseand suppressingInsecureRequestWarningdisables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configurehttpx.Client(verify="/path/to/ca-bundle.crt")instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.🔒 Recommended fix to support custom CA bundle
+ # Allow custom CA bundle or disable verification only if explicitly opted in + ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE") + verify_tls = ca_bundle if ca_bundle else True # Default to secure + + if verify_tls is False: + logger.warning("⚠️ TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure") + - with warnings.catch_warnings(): - warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning) - - # Create HTTP client with proper headers for litellm endpoint - http_client = httpx.Client( - verify=False, + http_client = httpx.Client( + verify=verify_tls, headers={🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client currently disables TLS verification by passing verify=False in the httpx.Client call in on_failure.py; replace this with a secure approach: read a configurable CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path) when present, and only fall back to verify=False when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing InsecureRequestWarning to the insecure branch and emit a clear log warning when verify is intentionally disabled. Ensure changes target the http_client construction site and associated warnings.catch_warnings() usage so the default behavior enforces certificate validation.
🧹 Nitpick comments (2)
projects/core/agentic/on_failure.py (1)
1559-1564: ⚡ Quick winConsider parameterizing the hardcoded model key.
The model key
"qwen-3-6-35b"is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.♻️ Suggested change
- MODEL_KEY = "qwen-3-6-35b" + MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the hardcoded MODEL_KEY assignment by reading a configurable environment variable: instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default string; then use that variable when calling models_config.get(MODEL_KEY) and raise the same ValueError if missing. Ensure you import os if not present and keep the existing models_config lookup and error handling intact.projects/fournos_launcher/orchestration/submit.py (1)
59-161: ⚡ Quick winConsider adding artifact directory validation.
The function uses
pathlib.Path(env.ARTIFACT_DIR)without verifying thatARTIFACT_DIRis set or that the path exists. IfARTIFACT_DIRisNoneor an invalid path, the subsequent glob operations could fail silently or raise exceptions.🛡️ Suggested defensive check
def generate_notification_files(): """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information.""" artifact_dir = pathlib.Path(env.ARTIFACT_DIR) + + if not artifact_dir or not artifact_dir.exists(): + logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}") + return + mlflow_urls = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The generate_notification_files function uses artifact_dir = pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the path exists; add a defensive check at the start of generate_notification_files to validate env.ARTIFACT_DIR is set (not None/empty) and that artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear warning via logger (e.g., logger.warning) and return early to skip the rest of the function, preventing runtime errors during artifact_dir.glob calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.
In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.
In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.
In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.
---
Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.
---
Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.
In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877
📒 Files selected for processing (8)
projects/core/agentic/on_failure.pyprojects/core/dsl/runtime.pyprojects/core/dsl/toolbox.pyprojects/core/notifications/send.pyprojects/fournos_launcher/orchestration/submit.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/test_phase.pyvaults/psap-models-corp-rh.yaml
There was a problem hiding this comment.
Caution
Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.
Actionable comments posted: 6
♻️ Duplicate comments (4)
projects/core/agentic/on_failure.py (4)
982-1007:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAsk for the 5-field schema on the first turn too.
When the model can answer from
FAILURE+ log content without requesting additional files, this prompt never asks forRoot Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed intoextract_structured_analysis()at line 1093. In that path,get_failure_explanations()will usually return empty structured fields even though the public API promises structured output.💡 Suggested fix
Add the 5-field schema request to the initial prompt:
## TASK: 1. Find failure patterns: - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks - In execution logs: Look for "----- FAILED ----" patterns or error indicators - Identify which specific tasks failed and why 2. What step broke. Why. 3. What action caused failure. -4. Need more files? Use EXACTLY this format: +4. Provide structured analysis in this format: + +1. **Root Cause**: One sentence. What broke. +2. **Failed Step**: Which step. Why it failed. +3. **Trigger**: Specific action/config that caused failure. +4. **Fix**: What to change. +5. **Prevent**: How to avoid next time. + +5. Need more files? Use EXACTLY this format:🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial prompt assigned to prompt_content should explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so downstream extract_structured_analysis() and get_failure_explanations() receive structured fields; modify the string built in prompt_content inside on_failure.py to append a short line asking for the five named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact field names match what extract_structured_analysis() and get_failure_explanations() expect.
1219-1299:⚠️ Potential issue | 🔴 Critical | ⚡ Quick winConstrain model-requested paths to the artifact tree before reading them.
The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into
rglob(). A prompt-injected request such as../../../../etc/passwdor*.pemcan therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outsidebase_artifact_dir, and fall back only to exact matches from the enumerated artifact files.🔒 Recommended fix
+ # Resolve and validate path is within artifact tree + try: + resolved_abs = resolved_path.resolve() + artifact_abs = base_artifact_dir.resolve() + if not resolved_abs.is_relative_to(artifact_abs): + logger.warning(f"❌ Rejected path outside artifact tree: {filename}") + file_contents[filename] = "❌ Unauthorized: File path outside artifact directory" + continue + except Exception as e: + logger.warning(f"Path resolution failed for {filename}: {e}") + file_contents[filename] = f"❌ Path resolution error: {e}" + continue + # First try the resolved path if resolved_path.exists(): found_path = resolved_path else: - # Search in the base artifact directory recursively (silent search) + # Search only in pre-enumerated artifact files (no glob patterns) try: - # Use a safe search pattern - escape any problematic characters - safe_filename = filename.replace("**", "*") # Replace double asterisks - for search_path in base_artifact_dir.rglob(safe_filename): - if search_path.is_file(): - found_path = search_path - break - - # If no match with pattern, try exact filename search - if not found_path: - for search_path in base_artifact_dir.rglob("*"): - if search_path.is_file() and search_path.name == filename: - found_path = search_path - break + # Only allow exact filename matches from available_files list + for available_file in available_files: + if Path(available_file).name == filename: + candidate = base_artifact_dir / available_file + if candidate.exists() and candidate.is_file(): + found_path = candidate + break except Exception as search_error: logger.warning(f"Search error for '{filename}': {search_error}")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The LLM-controlled filenames (handled in resolved_files, resolved_path, and later rglob search) can escape the artifact tree; fix by canonicalizing and constraining every candidate: when building resolved_path (and before any rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is within base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or skipping any path that does not lie under base_artifact_dir; remove use of user-controlled patterns in rglob (do not pass raw filename patterns to base_artifact_dir.rglob), instead pre-enumerate all files under base_artifact_dir into a safe map (path.resolve() values and names) and fall back only to exact matches from that enumerated set; log and return an error or skip entries that are outside the artifact tree and never read files outside base_artifact_dir into file_contents.
914-921:⚠️ Potential issue | 🟠 Major | ⚡ Quick winUse a unique report filename per failure.
process_failure_analysis()calls this once perFAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlierhtml_reportpaths end up pointing at the last one written.📝 One simple fix
- html_filename = "failure_analysis_report.html" + safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/")) + html_filename = f"failure_analysis_report_{safe_failure_dir}.html"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 914 - 921, process_failure_analysis currently writes every failure report to the constant "failure_analysis_report.html" (variables html_filename/html_path), causing later reports to overwrite earlier ones; change it to generate a unique filename per invocation (for example include the failure identifier, timestamp, or a uuid in the filename) before constructing html_path in the same block so each call creates a distinct file and return the new unique path string from process_failure_analysis.
205-218:⚠️ Potential issue | 🟠 Major | ⚡ Quick winDisabling TLS verification exposes the LLM client to man-in-the-middle attacks.
Setting
verify=Falseand suppressingInsecureRequestWarningdisables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configurehttpx.Client(verify="/path/to/ca-bundle.crt")instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.🔒 Recommended fix to support custom CA bundle
+ # Allow custom CA bundle or disable verification only if explicitly opted in + ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE") + verify_tls = ca_bundle if ca_bundle else True # Default to secure + + if verify_tls is False: + logger.warning("⚠️ TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure") + - with warnings.catch_warnings(): - warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning) - - # Create HTTP client with proper headers for litellm endpoint - http_client = httpx.Client( - verify=False, + http_client = httpx.Client( + verify=verify_tls, headers={🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client currently disables TLS verification by passing verify=False in the httpx.Client call in on_failure.py; replace this with a secure approach: read a configurable CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path) when present, and only fall back to verify=False when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing InsecureRequestWarning to the insecure branch and emit a clear log warning when verify is intentionally disabled. Ensure changes target the http_client construction site and associated warnings.catch_warnings() usage so the default behavior enforces certificate validation.
🧹 Nitpick comments (2)
projects/core/agentic/on_failure.py (1)
1559-1564: ⚡ Quick winConsider parameterizing the hardcoded model key.
The model key
"qwen-3-6-35b"is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.♻️ Suggested change
- MODEL_KEY = "qwen-3-6-35b" + MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the hardcoded MODEL_KEY assignment by reading a configurable environment variable: instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default string; then use that variable when calling models_config.get(MODEL_KEY) and raise the same ValueError if missing. Ensure you import os if not present and keep the existing models_config lookup and error handling intact.projects/fournos_launcher/orchestration/submit.py (1)
59-161: ⚡ Quick winConsider adding artifact directory validation.
The function uses
pathlib.Path(env.ARTIFACT_DIR)without verifying thatARTIFACT_DIRis set or that the path exists. IfARTIFACT_DIRisNoneor an invalid path, the subsequent glob operations could fail silently or raise exceptions.🛡️ Suggested defensive check
def generate_notification_files(): """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information.""" artifact_dir = pathlib.Path(env.ARTIFACT_DIR) + + if not artifact_dir or not artifact_dir.exists(): + logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}") + return + mlflow_urls = []🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The generate_notification_files function uses artifact_dir = pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the path exists; add a defensive check at the start of generate_notification_files to validate env.ARTIFACT_DIR is set (not None/empty) and that artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear warning via logger (e.g., logger.warning) and return early to skip the rest of the function, preventing runtime errors during artifact_dir.glob calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.
In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.
In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.
In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.
---
Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.
---
Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.
In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877
📒 Files selected for processing (8)
projects/core/agentic/on_failure.pyprojects/core/dsl/runtime.pyprojects/core/dsl/toolbox.pyprojects/core/notifications/send.pyprojects/fournos_launcher/orchestration/submit.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/test_phase.pyvaults/psap-models-corp-rh.yaml
🛑 Comments failed to post (5)
projects/core/agentic/on_failure.py (1)
34-49:
⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFix the import sorting to resolve the pipeline failure.
The Ruff linter reports that the import block is unsorted or unformatted (I001). This is causing the CI pipeline to fail.
Run
ruff check projects/core/agentic/on_failure.py --fixto automatically organize the imports, or manually sort them according to PEP 8 conventions (standard library, third-party, local imports).🧰 Tools
🪛 GitHub Actions: Ruff / 0_lint (3.12).txt
[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with
ruff check --fix).🪛 GitHub Actions: Ruff / lint (3.12)
[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with
ruff check projects/ bin/ --fix).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The import block in projects/core/agentic/on_failure.py is unsorted causing Ruff I001; fix it by running "ruff check projects/core/agentic/on_failure.py --fix" or manually reorder the imports in the module-level import section so that standard library imports (datetime, functools, json, logging, os, re, sys, warnings, pathlib.Path) come first, followed by third-party imports (click, httpx, urllib3, yaml, langchain_core.messages.HumanMessage), and then any local imports, ensure blank lines between groups and remove any unused imports; after reordering, run the linter to confirm the I001 is resolved.
⚠️ Potential issue | 🟠 Major | ⚡ Quick win🧩 Analysis chain
🏁 Script executed:
#!/bin/bash # Check if httpx, urllib3, langchain-core, langchain-openai are in pyproject.toml echo "=== Checking pyproject.toml for required dependencies ===" rg -n 'httpx|urllib3|langchain' pyproject.toml || echo "❌ Missing dependencies detected" echo "" echo "=== Dependencies imported in on_failure.py ===" rg -n '^import (httpx|urllib3)|^from langchain' projects/core/agentic/on_failure.pyRepository: openshift-psap/forge
Length of output: 288
🏁 Script executed:
#!/bin/bash set -euo pipefail echo "=== Imports in on_failure.py (full scan for relevant packages) ===" rg -n "^(import (httpx|urllib3)|from langchain(_core|_openai)|from langchain|import langchain|import .*langchain|from langchain_openai|import .*openai|from .*openai)" projects/core/agentic/on_failure.py || true echo echo "=== pyproject.toml: look for dependency declarations (more precise) ===" rg -n "(^|\\W)(httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain)(\\W|$)" pyproject.toml || true echo echo "=== pyproject.toml: show dependency sections around project.dependencies and optional-dependencies ===" # Print the first ~200 lines to locate [project] sections quickly (bounded output) sed -n '1,220p' pyproject.toml | nl -ba echo echo "=== pyproject.toml: if optional-dependencies exists, show it (bounded) ===" if rg -n "^\[project\.optional-dependencies\]" -S pyproject.toml >/dev/null; then start=$(rg -n "^\[project\.optional-dependencies\]" -n pyproject.toml | head -n1 | cut -d: -f1) end=$((start+220)) sed -n "${start},${end}p" pyproject.toml | nl -ba else echo "No [project.optional-dependencies] section found." fiRepository: openshift-psap/forge
Length of output: 456
🏁 Script executed:
#!/bin/bash set -euo pipefail echo "=== Locate pyproject.toml files ===" fd -t f -e toml "pyproject.toml" . || true echo # Choose the root pyproject.toml if it exists; otherwise fall back to first found PYPROJECT="$(fd -t f -e toml "pyproject.toml" . | head -n1 || true)" if [[ -z "${PYPROJECT}" ]]; then echo "No pyproject.toml found in repo." exit 0 fi echo "Using: ${PYPROJECT}" echo echo "=== Imports in projects/core/agentic/on_failure.py (full scan for relevant packages) ===" rg -n "(^import (httpx|urllib3)|^from langchain(_core|_openai)?(\.| )|^from langchain import|^from langchain_openai|^from .*openai|^import .*langchain|^import .*openai|^import .*yaml|^import .*click)" projects/core/agentic/on_failure.py || true echo echo "=== pyproject.toml: search for declared dependencies (strings) ===" rg -n "httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain" "${PYPROJECT}" || echo "No matching dependency strings found." echo echo "=== pyproject.toml: show [project] dependencies sections (bounded) ===" # Show near likely dependency keys rg -n "^\[project\]" "${PYPROJECT}" || true rg -n "^(dependencies|optional-dependencies)\s*=" "${PYPROJECT}" || true # Print first 260 lines to include [project] metadata and dependency lists (bounded output) sed -n '1,260p' "${PYPROJECT}" | cat -n echo if rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" >/dev/null; then start=$(rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" | head -n1 | cut -d: -f1) end=$((start+260)) echo "=== [project.optional-dependencies] (bounded) ===" sed -n "${start},${end}p" "${PYPROJECT}" | cat -n else echo "No [project.optional-dependencies] section found." fiRepository: openshift-psap/forge
Length of output: 7970
🏁 Script executed:
#!/bin/bash set -euo pipefail rg -n "langchain_openai|langchain(_core|_openai)?|import .*langchain|from .*langchain" projects/core/agentic/on_failure.py || trueRepository: openshift-psap/forge
Length of output: 235
Declare missing dependencies in pyproject.toml.
projects/core/agentic/on_failure.pyimportshttpx,urllib3, andlangchain_core, and later dynamically importslangchain_openai.ChatOpenAI, but./pyproject.tomlhas neither of these packages in[project].dependenciesnor in[project.optional-dependencies]. This will cause ImportErrors on a fresh install.🧰 Tools
🪛 GitHub Actions: Ruff / 0_lint (3.12).txt
[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with
ruff check --fix).🪛 GitHub Actions: Ruff / lint (3.12)
[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with
ruff check projects/ bin/ --fix).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The file on_failure.py imports httpx, urllib3, and langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these packages are not declared in pyproject.toml; update the project's pyproject.toml to add httpx, urllib3, and langchain_core to [project].dependencies (or put them under [project.optional-dependencies] with appropriate extras if optional) and ensure langchain_openai (or the package that provides ChatOpenAI) is declared as well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol succeed on a fresh install.projects/core/notifications/send.py (1)
299-313:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAdd exception handling for file read operations.
The NOTIFICATION.md file read (lines 304-305) is not wrapped in a try-except block, unlike the NOTIFICATION.html read in
get_common_message(lines 215-223). If the file read fails (e.g., permissions issue, encoding error), it will raise an exception and potentially break the Slack notification flow.🛡️ Suggested error handling
# Check for fournos_launcher generated Slack notification content artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", "")) fournos_notification_md = artifact_dir / "NOTIFICATION.md" if fournos_notification_md.exists(): - with open(fournos_notification_md, encoding="utf-8") as f: - fournos_content = f.read().strip() - if fournos_content: - # Return custom notification content with status icon - return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" + try: + with open(fournos_notification_md, encoding="utf-8") as f: + fournos_content = f.read().strip() + if fournos_content: + # Return custom notification content with status icon + return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" + except Exception as e: + logger.warning("Failed to read NOTIFICATION.md: %s", e) # Fallback to standard message if no custom notification📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.# Check for fournos_launcher generated Slack notification content artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", "")) fournos_notification_md = artifact_dir / "NOTIFICATION.md" if fournos_notification_md.exists(): try: with open(fournos_notification_md, encoding="utf-8") as f: fournos_content = f.read().strip() if fournos_content: # Return custom notification content with status icon return f"{status_icon} {get_bold(status)}\n\n{fournos_content}" except Exception as e: logger.warning("Failed to read NOTIFICATION.md: %s", e) # Fallback to standard message if no custom notification return get_common_message( finish_reason, f"{status_icon} {status}", get_link, get_italics, get_bold )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/core/notifications/send.py` around lines 299 - 313, The NOTIFICATION.md file read (artifact_dir / "NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except to avoid raising on I/O/encoding/permission errors; update the block in send.py around the with open(fournos_notification_md, ...) and fournos_content = f.read().strip() to catch Exception as e, handle/log the error (consistent with the existing logging approach used by get_common_message's NOTIFICATION.html handling), and fall back to returning get_common_message(finish_reason, f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs or the file is unreadable.projects/llm_d/orchestration/test_phase.py (2)
93-96:
⚠️ Potential issue | 🔴 Critical | ⚡ Quick winFinalizers are unconditionally disabled.
do_finalizers = Falseon line 93 prevents all finalizer callbacks from executing, including:
- Capturing inference-service state
- Writing endpoint URL to artifacts
- Capturing namespace events
- Cleaning up test resources (pods, jobs, PVCs)
The
SignalInterruptcheck on lines 94-96 is redundant sincedo_finalizersis alreadyFalse. If this is intentional debugging code, consider removing it before merge or adding a comment explaining why finalizers are disabled. Otherwise, resource leaks and missing artifacts will occur.Suggested fix to re-enable finalizers
finally: - do_finalizers = False + do_finalizers = True if primary_exc and isinstance(primary_exc[1], SignalInterrupt): logging.warning("Caught a SignalInterrupt, skipping the finalizers") do_finalizers = False🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm_d/orchestration/test_phase.py` around lines 93 - 96, The code unconditionally disables finalizers by setting do_finalizers = False before checking primary_exc; remove that unconditional assignment and instead initialize do_finalizers = True (or omit the assignment) and only set do_finalizers = False when primary_exc indicates a SignalInterrupt (i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set do_finalizers = False there); ensure the variable controlling execution of finalizer callbacks (do_finalizers) is True in the normal flow so finalizer callbacks (capture state, write endpoint, capture events, cleanup) run unless a SignalInterrupt is detected.
170-181:
⚠️ Potential issue | 🔴 Critical | ⚡ Quick winUnreachable code:
returnbefore docstring makes function body dead code.The
returnstatement on line 171 causes all subsequent code (lines 172-180) to be unreachable, including the docstring and the actualprepare_model_cache()call. This bypasses model cache preparation entirely, which per the codebase handles cache validation, PVC setup, and HF token retrieval.If this is temporary debugging code, please revert before merge. The docstring placement after
returnalso indicates this may have been an accidental edit.Suggested fix to restore model cache preparation
def _prepare_model_cache() -> None: - return """Ensure model cache PVC is ready for deployment.""" from projects.llm_d.orchestration import runtime_config model_key = runtime_config.get_model_key() logger.info("Preparing model cache for model: %s", model_key) # Use the same prepare_model_cache function as the prepare phase # This includes vault token handling and PVC existence checks prepare_model_cache()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.def _prepare_model_cache() -> None: """Ensure model cache PVC is ready for deployment.""" from projects.llm_d.orchestration import runtime_config model_key = runtime_config.get_model_key() logger.info("Preparing model cache for model: %s", model_key) # Use the same prepare_model_cache function as the prepare phase # This includes vault token handling and PVC existence checks prepare_model_cache()🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@projects/llm_d/orchestration/test_phase.py` around lines 170 - 181, The _prepare_model_cache function currently returns immediately making its docstring and body (including the call to prepare_model_cache()) unreachable; remove the stray return, place the docstring right after the def if desired, and restore the body so it imports runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via logger.info, and calls prepare_model_cache() to perform PVC/token handling; verify references to _prepare_model_cache(), runtime_config, logger, and prepare_model_cache are unchanged.vaults/psap-models-corp-rh.yaml (1)
8-8:
⚠️ Potential issue | 🟡 Minor | ⚡ Quick winFix typo in description.
"use" should be "used" for grammatical correctness.
📝 Proposed fix
- Access to the RH models use by FORGE agent + Access to the RH models used by FORGE agent📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.Access to the RH models used by FORGE agent🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@vaults/psap-models-corp-rh.yaml` at line 8, Fix the typo in the YAML description string "Access to the RH models use by FORGE agent" by changing "use" to "used" so the line reads "Access to the RH models used by FORGE agent"; locate and update that description entry in the psap-models-corp-rh YAML block.
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 59 seconds 🔴 • Link to the test results. • No reports generated... Test configuration: |
|
/test fournos llm_d |
|
🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 10 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 51 seconds 🔴 • Link to the test results. • No reports generated... Test configuration: |
0db96b1 to
c2e59de
Compare
|
/test fournos llm_d |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@projects/core/ci_entrypoint/run_common.py`:
- Line 32: EXTRA_PACKAGES is currently an unpinned list (EXTRA_PACKAGES =
["httpx", "langchain-openai", "langchain-core"]) which makes CI installs
non-deterministic; update the EXTRA_PACKAGES variable to include explicit
version pins (or modify install_extra_packages() to install using a shared
constraints/lockfile) so CI always installs fixed versions — for example,
replace unpinned names in EXTRA_PACKAGES with versioned strings or change
install_extra_packages() to call pip with --constraint pointing to your
repository's constraints/lockfile (ensure you update any related docs/tests to
reflect the pinned versions).
In `@projects/core/notifications/send.py`:
- Around line 303-309: Wrap the optional NOTIFICATION.md read for
fournos_notification_md in a try/except so I/O/encoding/race errors don't abort
Slack message generation: when evaluating fournos_notification_md.exists()
surround the open/read/strip sequence with a broad exception handler that logs
or ignores the error and falls back to returning the default Slack message
(i.e., only return the custom "{status_icon}
{get_bold(status)}\\n\\n{fournos_content}" when read succeeds and
fournos_content is non-empty); reference the fournos_notification_md variable
and the return site that builds the status_icon/get_bold(status) message to
locate the change.
In `@vaults/psap-models-corp-rh.yaml`:
- Around line 6-17: The vault entry contains an unused format field and a typo:
VaultManager._load_vault_definition only reads the content key's file and
description, so remove or relocate the example block under the format key from
the content entry (move it into a YAML comment or external docs) and correct the
description text from "use by" to "used by"; ensure the remaining content only
includes the keys consumed at load time (file/description) so maintainers aren't
misled by an ignored format field.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 91f0387f-50d1-408e-8966-0282847d630b
📒 Files selected for processing (10)
projects/core/agentic/on_failure.pyprojects/core/ci_entrypoint/run_common.pyprojects/core/dsl/runtime.pyprojects/core/dsl/toolbox.pyprojects/core/notifications/send.pyprojects/fournos_launcher/orchestration/submit.pyprojects/llm_d/orchestration/ci.pyprojects/llm_d/orchestration/config.yamlprojects/llm_d/orchestration/test_phase.pyvaults/psap-models-corp-rh.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
- projects/fournos_launcher/orchestration/submit.py
- projects/llm_d/orchestration/test_phase.py
- projects/core/dsl/runtime.py
- projects/llm_d/orchestration/ci.py
- projects/core/agentic/on_failure.py
|
🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 11 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 49 seconds 🔴 • Link to the test results. • No reports generated... • Task Logs:
Test configuration: |
|
/test fournos llm_d |
|
🔴 Test of 'llm_d test' failed after 00 hours 11 minutes 05 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 12 minutes 43 seconds 🔴 • Link to the test results. • No reports generated... • Task Logs:
Test configuration: |
895e932 to
9659c59
Compare
|
/test fournos llm_d |
|
🔴 Test of 'llm_d test' failed after 00 hours 11 minutes 28 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 13 minutes 07 seconds 🔴 • Link to the test results. • No reports generated... • Task Logs:
Test configuration: |
|
/test fournos llm_d smoke |
|
/test fournos llm_d smoke |
|
🔴 Test of 'llm_d test' failed after 00 hours 16 minutes 44 seconds 🔴 • Link to the test results. • Caliper postprocess completed but no reports generated. Test configuration: TEST DESCRIPTION:
FAILURE REVIEW 002 deploy llmisvc:
|
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 18 minutes 40 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• MLflow Tracking: • FOURNOS Job Details: • Task Logs:
Test configuration: |
|
/test fournos llm_d smoke |
|
🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 31 seconds 🔴 • Link to the test results. • FOURNOS Job Status:
• FOURNOS Job Details: • Task Logs: Test configuration: |
|
@kpouget: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/test fournos llm_d smoke |
Summary by CodeRabbit