Skip to content

[core] agentic: on_failure: start working on agentic troubleshooting#70

Open
kpouget wants to merge 19 commits into
openshift-psap:mainfrom
kpouget:agentic
Open

[core] agentic: on_failure: start working on agentic troubleshooting#70
kpouget wants to merge 19 commits into
openshift-psap:mainfrom
kpouget:agentic

Conversation

@kpouget

@kpouget kpouget commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Summary by CodeRabbit

  • New Features
    • Added an on-failure analysis agent with an interactive CLI, producing per-failure structured results and HTML reports.
  • Improvements
    • Automatically generates and propagates FAILURE artifacts for deeper diagnostics.
    • Upgraded notification composition (Slack/HTML) to include FAILURE review content, better artifact links, and clearer parsing when reports exist.
    • Improved retry/log visibility and refined job-status YAML capture for notifications.
  • Bug Fixes
    • Adjusted test-phase execution flow to skip certain finalization and model-cache preparation steps.
  • Chores
    • Added required LLM-related dependencies.

@openshift-ci

openshift-ci Bot commented Jun 3, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tosokin for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai

coderabbitai Bot commented Jun 3, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a LangChain-backed "on failure" agent that discovers FAILURE artifacts written by the runtime and toolbox on exceptions, runs a five-step multi-query LLM analysis, and generates FAILURE_REVIEW_* and HTML report artifacts. Notification senders are updated to consume these files and new NOTIFICATION.html/.md files generated by the fournos_launcher submit path. CI commands for llm_d get the @agent_review_on_failure decorator, and test-phase finalizers and model-cache preparation are disabled.

Changes

On Failure Analysis Agent

Layer / File(s) Summary
LLM client, vault config, and CI decorator scaffolding
projects/core/agentic/on_failure/__init__.py, vaults/psap-models-corp-rh.yaml, pyproject.toml, projects/core/ci_entrypoint/run_common.py
Module constants, agent_review_on_failure decorator, load_model_config/create_llm_client with vault-backed model selection, normalized base URL, disabled SSL, and httpx client; vault credential template and dependency additions.
Multi-query prompt system (FailureAnalysisQueries)
projects/core/agentic/failure_analysis_queries.py
FailureAnalysisQueries builds shared markdown context and five ordered prompt payloads; execute_query_sequence resolves dependencies, invokes the LLM, captures token/timing metadata, and returns aggregated results.
FAILURE discovery, log ingestion, and structured extraction
projects/core/agentic/on_failure/__init__.py
Utilities find FAILURE files, read paired ansible/task logs, extract DSL task-failure patterns, enumerate artifact files, and regex-extract structured analysis fields from LLM text.
HTML report template and generation
projects/core/agentic/failure_analysis_report_template.html, projects/core/agentic/on_failure/report.py
HTML template with collapsible CSS/JS sections; generate_html_report loads it (or falls back inline), escapes and wraps query/response content, injects per-interaction metadata, aggregates file-consumption summary, and writes failure_analysis_report.html.
Per-failure multi-query analysis orchestration
projects/core/agentic/on_failure/__init__.py
analyze_single_failure_multi_query selects relevant files, runs the query sequence, builds the structured result, generates FAILURE_REVIEW_* and optional HTML; process_failure_analysis aggregates counts; run_on_failure_agent wires vault+model end-to-end.
CLI output and module entrypoint
projects/core/agentic/on_failure/cli.py
Click command accepts artifact dir, output format, and verbosity; dispatches between JSON and formatted text output using print helpers for header, summary, per-failure metadata, structured fields, and full analysis text.

FAILURE generation, notifications, and CI wiring

Layer / File(s) Summary
FAILURE file generation in runtime and toolbox
projects/core/dsl/runtime.py, projects/core/dsl/toolbox.py, projects/core/dsl/log.py
execute_tasks exception handler calls _generate_failure_file_for_agent; run_toolbox_command calls a lazy-import wrapper; _format_failure_content_for_agent assembles timestamp/stack/context report; task log banner strips surrounding whitespace.
DSL retry-attempt logging refactor
projects/core/dsl/task.py
_log_retry_attempt centralizes formatted retry logging (task id, elapsed time, delay, sleep); both the return-value and exception retry paths in _execute_with_retry call it instead of duplicating inline logic.
Notification file generation and consumption
projects/fournos_launcher/orchestration/submit.py, projects/core/notifications/send.py
generate_notification_files scans for forge-llm-d final-status YAMLs, extracts successful MLflow run URLs, and writes NOTIFICATION.html/.md; get_common_message appends NOTIFICATION.html and prefers FAILURE_REVIEW_* files; get_slack_thread_message returns early with NOTIFICATION.md content when present.
Job YAML annotation cleanup in submit_and_wait
projects/fournos_launcher/toolbox/submit_and_wait/main.py
capture_final_job_status writes YAML to an explicit path, strips unwanted Kubernetes/kopf annotations, and rewrites the cleaned file; capture_pod_specs extracts the output path to a variable.
CI decorator application and environment wiring
projects/llm_d/orchestration/ci.py, projects/llm_d/orchestration/config.yaml
@agent_review_on_failure applied to prepare, test, and pre_cleanup subcommands; psap-models-corp-rh added to optional vaults list.
Disable finalizers and skip model cache in test phase
projects/llm_d/orchestration/test_phase.py
do_test forces do_finalizers = False in the finally block; _prepare_model_cache inserts an unconditional early return skipping all cache-preparation logic.

Sequence Diagram

sequenceDiagram
  rect rgba(173, 216, 230, 0.5)
    note over CICommand,ArtifactDir: CI execution and FAILURE artifact generation
    CICommand->>runtime_py: execute_tasks()
    runtime_py->>ArtifactDir: write FAILURE file (_generate_failure_file_for_agent)
  end

  rect rgba(255, 218, 185, 0.5)
    note over agent_review_on_failure,ChatOpenAI: On-failure LLM analysis
    CICommand->>agent_review_on_failure: non-zero exit or exception
    agent_review_on_failure->>run_on_failure_agent: run_on_failure_agent(base_artifact_dir)
    run_on_failure_agent->>Vault: load_model_config(psap-models-corp-rh)
    Vault-->>run_on_failure_agent: model credentials
    run_on_failure_agent->>ChatOpenAI: create_llm_client(model_config)
    run_on_failure_agent->>process_failure_analysis: process_failure_analysis(dir, llm)
    process_failure_analysis->>ArtifactDir: find_failure_files()
    ArtifactDir-->>process_failure_analysis: FAILURE paths
    process_failure_analysis->>execute_query_sequence: FailureAnalysisQueries + llm
    execute_query_sequence->>ChatOpenAI: 5x llm.invoke(HumanMessage)
    ChatOpenAI-->>execute_query_sequence: analysis responses
    execute_query_sequence-->>process_failure_analysis: structured results
    process_failure_analysis->>ArtifactDir: write FAILURE_REVIEW_* + failure_analysis_report.html
  end

  rect rgba(144, 238, 144, 0.5)
    note over submit_py,Slack: Notification delivery
    submit_py->>ArtifactDir: write NOTIFICATION.html + NOTIFICATION.md
    send_py->>ArtifactDir: read FAILURE_REVIEW_* + NOTIFICATION.html + NOTIFICATION.md
    send_py->>Slack: post message with failure analysis and MLflow links
  end
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • openshift-psap/forge#6: Adds the initial Click-based prepare/test/pre_cleanup subcommands in projects/llm_d/orchestration/ci.py—this PR adds @agent_review_on_failure to those same subcommands.
  • openshift-psap/forge#17: Modifies projects/core/dsl/toolbox.py around run_toolbox_command error handling—this PR extends that same handler to call _generate_failure_file on exceptions.
  • openshift-psap/forge#26: Modifies projects/core/dsl/runtime.py execute_tasks function—this PR adds FAILURE-artifact generation on exceptions while the retrieved PR changes execute_tasks to return shared_context and attach artifact_dir.

Poem

🐇 Hoppity-hop through the broken CI,
The rabbit reads logs so you don't have to cry.
A FAILURE file lands, the agent awakes,
Five queries to LLM, no tea-time it takes.
NOTIFICATION.md glows in the Slack night—
✨ The bunny brings order from stderr's fright! 🌙

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and specifically describes the main objective: introducing agentic failure analysis capabilities to the core module, as evidenced by the substantial new code across multiple files implementing failure analysis, LLM integration, and automated agent-based troubleshooting.
Docstring Coverage ✅ Passed Docstring coverage is 89.32% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1005-1077: The code currently trusts LLM-provided paths and
patterns (resolved_files, resolved_path, safe_filename, rglob) which allows
directory traversal and glob injection; fix by resolving each candidate path to
an absolute Path via resolved_path.resolve(strict=False) and verifying it is
inside base_artifact_dir.resolve() (e.g., compare parents or use
is_relative_to), and reject/skip any path that is outside the artifact tree
(return an error or mark as not found). Stop passing model-controlled patterns
into rglob: instead build a safe index once from base_artifact_dir (enumerate
base_artifact_dir.rglob("*") to collect actual artifact file paths and names)
and only allow fallback matches by exact filename lookup against that enumerated
list; do not perform wildcard rglob with user input. Ensure any
rejected/unauthorized requests are logged and not read/sent back to the LLM.
- Around line 802-815: Update the prompt construction in on_failure.py so the
initial user-facing prompt always requests the five-field schema (Root Cause,
Failed Step, Trigger, Fix, Prevent) on the very first turn; modify the logic in
the functions that build or call get_failure_explanations() (and the related
prompt text around lines referenced near 802-815 and 878-884) to prepend or
include an explicit instruction asking for the 5-field schema and the
NEED_MORE_FILES line format whenever FAILURE + _ansible.log are present,
ensuring extract_structured_analysis() receives those structured fields
populated.
- Around line 741-746: process_failure_analysis() currently writes every HTML
report to the constant name html_filename = "failure_analysis_report.html",
causing subsequent reports to overwrite previous ones; change the filename
generation to be unique per failure (e.g., include a unique failure identifier
or a timestamp/uuid) when building html_filename/html_path so each call writes
to a distinct file (reference html_filename, html_path, and the
process_failure_analysis() call site to ensure the returned/stored html_report
path matches the new unique name).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a9bedd50-d492-43e7-b8a3-7c8641712fc9

📥 Commits

Reviewing files that changed from the base of the PR and between 191d353 and 0743170.

📒 Files selected for processing (1)
  • projects/core/agentic/on_failure.py

Comment thread projects/core/agentic/on_failure/__init__.py Outdated
Comment thread projects/core/agentic/on_failure.py Outdated
Comment thread projects/core/agentic/on_failure.py Outdated
@kpouget kpouget force-pushed the agentic branch 3 times, most recently from b117cb0 to 476e171 Compare June 5, 2026 12:27
@kpouget

kpouget commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster psap-mgmt
/pipeline forge-test-only

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (4)
projects/core/agentic/on_failure.py (4)

982-1007: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ask for the 5-field schema on the first turn too.

When the model can answer from FAILURE + log content without requesting additional files, this prompt never asks for Root Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed into extract_structured_analysis() at line 1093. In that path, get_failure_explanations() will usually return empty structured fields even though the public API promises structured output.

💡 Suggested fix

Add the 5-field schema request to the initial prompt:

 ## TASK:
 1. Find failure patterns:
    - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks
    - In execution logs: Look for "----- FAILED ----" patterns or error indicators
    - Identify which specific tasks failed and why
 2. What step broke. Why.
 3. What action caused failure.
-4. Need more files? Use EXACTLY this format:
+4. Provide structured analysis in this format:
+
+1. **Root Cause**: One sentence. What broke.
+2. **Failed Step**: Which step. Why it failed.
+3. **Trigger**: Specific action/config that caused failure.
+4. **Fix**: What to change.
+5. **Prevent**: How to avoid next time.
+
+5. Need more files? Use EXACTLY this format:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial
prompt assigned to prompt_content should explicitly request the 5-field schema
(Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so
downstream extract_structured_analysis() and get_failure_explanations() receive
structured fields; modify the string built in prompt_content inside
on_failure.py to append a short line asking for the five named fields in a fixed
format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ...,
Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact
field names match what extract_structured_analysis() and
get_failure_explanations() expect.

1219-1299: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Constrain model-requested paths to the artifact tree before reading them.

The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into rglob(). A prompt-injected request such as ../../../../etc/passwd or *.pem can therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outside base_artifact_dir, and fall back only to exact matches from the enumerated artifact files.

🔒 Recommended fix
+        # Resolve and validate path is within artifact tree
+        try:
+            resolved_abs = resolved_path.resolve()
+            artifact_abs = base_artifact_dir.resolve()
+            if not resolved_abs.is_relative_to(artifact_abs):
+                logger.warning(f"❌ Rejected path outside artifact tree: {filename}")
+                file_contents[filename] = "❌ Unauthorized: File path outside artifact directory"
+                continue
+        except Exception as e:
+            logger.warning(f"Path resolution failed for {filename}: {e}")
+            file_contents[filename] = f"❌ Path resolution error: {e}"
+            continue
+
         # First try the resolved path
         if resolved_path.exists():
             found_path = resolved_path
         else:
-            # Search in the base artifact directory recursively (silent search)
+            # Search only in pre-enumerated artifact files (no glob patterns)
             try:
-                # Use a safe search pattern - escape any problematic characters
-                safe_filename = filename.replace("**", "*")  # Replace double asterisks
-                for search_path in base_artifact_dir.rglob(safe_filename):
-                    if search_path.is_file():
-                        found_path = search_path
-                        break
-
-                # If no match with pattern, try exact filename search
-                if not found_path:
-                    for search_path in base_artifact_dir.rglob("*"):
-                        if search_path.is_file() and search_path.name == filename:
-                            found_path = search_path
-                            break
+                # Only allow exact filename matches from available_files list
+                for available_file in available_files:
+                    if Path(available_file).name == filename:
+                        candidate = base_artifact_dir / available_file
+                        if candidate.exists() and candidate.is_file():
+                            found_path = candidate
+                            break
             except Exception as search_error:
                 logger.warning(f"Search error for '{filename}': {search_error}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The
LLM-controlled filenames (handled in resolved_files, resolved_path, and later
rglob search) can escape the artifact tree; fix by canonicalizing and
constraining every candidate: when building resolved_path (and before any
rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is
within base_artifact_dir.resolve() (e.g., via startswith or comparison),
rejecting or skipping any path that does not lie under base_artifact_dir; remove
use of user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.

914-921: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique report filename per failure.

process_failure_analysis() calls this once per FAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlier html_report paths end up pointing at the last one written.

📝 One simple fix
-    html_filename = "failure_analysis_report.html"
+    safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/"))
+    html_filename = f"failure_analysis_report_{safe_failure_dir}.html"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 914 - 921,
process_failure_analysis currently writes every failure report to the constant
"failure_analysis_report.html" (variables html_filename/html_path), causing
later reports to overwrite earlier ones; change it to generate a unique filename
per invocation (for example include the failure identifier, timestamp, or a uuid
in the filename) before constructing html_path in the same block so each call
creates a distinct file and return the new unique path string from
process_failure_analysis.

205-218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disabling TLS verification exposes the LLM client to man-in-the-middle attacks.

Setting verify=False and suppressing InsecureRequestWarning disables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configure httpx.Client(verify="/path/to/ca-bundle.crt") instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.

🔒 Recommended fix to support custom CA bundle
+    # Allow custom CA bundle or disable verification only if explicitly opted in
+    ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE")
+    verify_tls = ca_bundle if ca_bundle else True  # Default to secure
+
+    if verify_tls is False:
+        logger.warning("⚠️  TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure")
+
-    with warnings.catch_warnings():
-        warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning)
-
-        # Create HTTP client with proper headers for litellm endpoint
-        http_client = httpx.Client(
-            verify=False,
+        http_client = httpx.Client(
+            verify=verify_tls,
             headers={
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client
currently disables TLS verification by passing verify=False in the httpx.Client
call in on_failure.py; replace this with a secure approach: read a configurable
CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string
to httpx.Client(verify=ca_path) when present, and only fall back to verify=False
when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set;
also restrict suppressing InsecureRequestWarning to the insecure branch and emit
a clear log warning when verify is intentionally disabled. Ensure changes target
the http_client construction site and associated warnings.catch_warnings() usage
so the default behavior enforces certificate validation.
🧹 Nitpick comments (2)
projects/core/agentic/on_failure.py (1)

1559-1564: ⚡ Quick win

Consider parameterizing the hardcoded model key.

The model key "qwen-3-6-35b" is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.

♻️ Suggested change
-        MODEL_KEY = "qwen-3-6-35b"
+        MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the
hardcoded MODEL_KEY assignment by reading a configurable environment variable:
instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment
lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default
string; then use that variable when calling models_config.get(MODEL_KEY) and
raise the same ValueError if missing. Ensure you import os if not present and
keep the existing models_config lookup and error handling intact.
projects/fournos_launcher/orchestration/submit.py (1)

59-161: ⚡ Quick win

Consider adding artifact directory validation.

The function uses pathlib.Path(env.ARTIFACT_DIR) without verifying that ARTIFACT_DIR is set or that the path exists. If ARTIFACT_DIR is None or an invalid path, the subsequent glob operations could fail silently or raise exceptions.

🛡️ Suggested defensive check
 def generate_notification_files():
     """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information."""
     artifact_dir = pathlib.Path(env.ARTIFACT_DIR)
+    
+    if not artifact_dir or not artifact_dir.exists():
+        logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}")
+        return
+    
     mlflow_urls = []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The
generate_notification_files function uses artifact_dir =
pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.

In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.

In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.

In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.

---

Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.

---

Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877

📥 Commits

Reviewing files that changed from the base of the PR and between 0743170 and 24fa1ab.

📒 Files selected for processing (8)
  • projects/core/agentic/on_failure.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/toolbox.py
  • projects/core/notifications/send.py
  • projects/fournos_launcher/orchestration/submit.py
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/test_phase.py
  • vaults/psap-models-corp-rh.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 6

♻️ Duplicate comments (4)
projects/core/agentic/on_failure.py (4)

982-1007: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Ask for the 5-field schema on the first turn too.

When the model can answer from FAILURE + log content without requesting additional files, this prompt never asks for Root Cause / Failed Step / Trigger / Fix / Prevent, yet the result is still fed into extract_structured_analysis() at line 1093. In that path, get_failure_explanations() will usually return empty structured fields even though the public API promises structured output.

💡 Suggested fix

Add the 5-field schema request to the initial prompt:

 ## TASK:
 1. Find failure patterns:
    - In DSL logs: Look for "==> TASK FAILED:" patterns - these mark specific failed tasks
    - In execution logs: Look for "----- FAILED ----" patterns or error indicators
    - Identify which specific tasks failed and why
 2. What step broke. Why.
 3. What action caused failure.
-4. Need more files? Use EXACTLY this format:
+4. Provide structured analysis in this format:
+
+1. **Root Cause**: One sentence. What broke.
+2. **Failed Step**: Which step. Why it failed.
+3. **Trigger**: Specific action/config that caused failure.
+4. **Fix**: What to change.
+5. **Prevent**: How to avoid next time.
+
+5. Need more files? Use EXACTLY this format:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 982 - 1007, The initial
prompt assigned to prompt_content should explicitly request the 5-field schema
(Root Cause / Failed Step / Trigger / Fix / Prevent) on the first turn so
downstream extract_structured_analysis() and get_failure_explanations() receive
structured fields; modify the string built in prompt_content inside
on_failure.py to append a short line asking for the five named fields in a fixed
format (e.g. "Return: Root Cause: ..., Failed Step: ..., Trigger: ..., Fix: ...,
Prevent: ...") whenever FAILURE and log_content are provided, ensuring the exact
field names match what extract_structured_analysis() and
get_failure_explanations() expect.

1219-1299: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Constrain model-requested paths to the artifact tree before reading them.

The requested filenames come from the LLM, but relative paths are joined verbatim and the fallback search also feeds model-controlled text into rglob(). A prompt-injected request such as ../../../../etc/passwd or *.pem can therefore disclose arbitrary local files and then send that content back to the LLM. Resolve each path, reject anything outside base_artifact_dir, and fall back only to exact matches from the enumerated artifact files.

🔒 Recommended fix
+        # Resolve and validate path is within artifact tree
+        try:
+            resolved_abs = resolved_path.resolve()
+            artifact_abs = base_artifact_dir.resolve()
+            if not resolved_abs.is_relative_to(artifact_abs):
+                logger.warning(f"❌ Rejected path outside artifact tree: {filename}")
+                file_contents[filename] = "❌ Unauthorized: File path outside artifact directory"
+                continue
+        except Exception as e:
+            logger.warning(f"Path resolution failed for {filename}: {e}")
+            file_contents[filename] = f"❌ Path resolution error: {e}"
+            continue
+
         # First try the resolved path
         if resolved_path.exists():
             found_path = resolved_path
         else:
-            # Search in the base artifact directory recursively (silent search)
+            # Search only in pre-enumerated artifact files (no glob patterns)
             try:
-                # Use a safe search pattern - escape any problematic characters
-                safe_filename = filename.replace("**", "*")  # Replace double asterisks
-                for search_path in base_artifact_dir.rglob(safe_filename):
-                    if search_path.is_file():
-                        found_path = search_path
-                        break
-
-                # If no match with pattern, try exact filename search
-                if not found_path:
-                    for search_path in base_artifact_dir.rglob("*"):
-                        if search_path.is_file() and search_path.name == filename:
-                            found_path = search_path
-                            break
+                # Only allow exact filename matches from available_files list
+                for available_file in available_files:
+                    if Path(available_file).name == filename:
+                        candidate = base_artifact_dir / available_file
+                        if candidate.exists() and candidate.is_file():
+                            found_path = candidate
+                            break
             except Exception as search_error:
                 logger.warning(f"Search error for '{filename}': {search_error}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1219 - 1299, The
LLM-controlled filenames (handled in resolved_files, resolved_path, and later
rglob search) can escape the artifact tree; fix by canonicalizing and
constraining every candidate: when building resolved_path (and before any
rglob), use (base_artifact_dir / relative).resolve() and ensure resolved_path is
within base_artifact_dir.resolve() (e.g., via startswith or comparison),
rejecting or skipping any path that does not lie under base_artifact_dir; remove
use of user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.

914-921: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Use a unique report filename per failure.

process_failure_analysis() calls this once per FAILURE, but the filename is constant, so every verbose analysis overwrites the previous report and all earlier html_report paths end up pointing at the last one written.

📝 One simple fix
-    html_filename = "failure_analysis_report.html"
+    safe_failure_dir = re.sub(r"[^A-Za-z0-9._-]+", "_", failure_dir.strip("/"))
+    html_filename = f"failure_analysis_report_{safe_failure_dir}.html"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 914 - 921,
process_failure_analysis currently writes every failure report to the constant
"failure_analysis_report.html" (variables html_filename/html_path), causing
later reports to overwrite earlier ones; change it to generate a unique filename
per invocation (for example include the failure identifier, timestamp, or a uuid
in the filename) before constructing html_path in the same block so each call
creates a distinct file and return the new unique path string from
process_failure_analysis.

205-218: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Disabling TLS verification exposes the LLM client to man-in-the-middle attacks.

Setting verify=False and suppressing InsecureRequestWarning disables certificate validation entirely. Even on internal networks, an attacker with network access can intercept credentials (user_key) and LLM traffic. If the corporate endpoint uses self-signed certificates, configure httpx.Client(verify="/path/to/ca-bundle.crt") instead. If TLS bypass is truly required, document the threat model and enforce it only when an explicit environment variable is set.

🔒 Recommended fix to support custom CA bundle
+    # Allow custom CA bundle or disable verification only if explicitly opted in
+    ca_bundle = os.environ.get("FORGE_AGENT_CA_BUNDLE")
+    verify_tls = ca_bundle if ca_bundle else True  # Default to secure
+
+    if verify_tls is False:
+        logger.warning("⚠️  TLS verification disabled via FORGE_AGENT_DISABLE_TLS_VERIFY - connections are not secure")
+
-    with warnings.catch_warnings():
-        warnings.simplefilter("ignore", urllib3.exceptions.InsecureRequestWarning)
-
-        # Create HTTP client with proper headers for litellm endpoint
-        http_client = httpx.Client(
-            verify=False,
+        http_client = httpx.Client(
+            verify=verify_tls,
             headers={
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 205 - 218, The HTTP client
currently disables TLS verification by passing verify=False in the httpx.Client
call in on_failure.py; replace this with a secure approach: read a configurable
CA bundle path (e.g., from an env var like HTTPX_CA_BUNDLE) and pass that string
to httpx.Client(verify=ca_path) when present, and only fall back to verify=False
when an explicit allow-insecure env var (e.g., DISABLE_TLS_VERIFY=true) is set;
also restrict suppressing InsecureRequestWarning to the insecure branch and emit
a clear log warning when verify is intentionally disabled. Ensure changes target
the http_client construction site and associated warnings.catch_warnings() usage
so the default behavior enforces certificate validation.
🧹 Nitpick comments (2)
projects/core/agentic/on_failure.py (1)

1559-1564: ⚡ Quick win

Consider parameterizing the hardcoded model key.

The model key "qwen-3-6-35b" is hardcoded, which limits flexibility if you need to test with different models or switch models based on environment. Consider reading it from an environment variable with a fallback to the current default.

♻️ Suggested change
-        MODEL_KEY = "qwen-3-6-35b"
+        MODEL_KEY = os.environ.get("FORGE_AGENT_MODEL_KEY", "qwen-3-6-35b")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 1559 - 1564, Replace the
hardcoded MODEL_KEY assignment by reading a configurable environment variable:
instead of setting MODEL_KEY = "qwen-3-6-35b", obtain it via an environment
lookup (e.g., os.getenv("MODEL_KEY")) with a fallback to the current default
string; then use that variable when calling models_config.get(MODEL_KEY) and
raise the same ValueError if missing. Ensure you import os if not present and
keep the existing models_config lookup and error handling intact.
projects/fournos_launcher/orchestration/submit.py (1)

59-161: ⚡ Quick win

Consider adding artifact directory validation.

The function uses pathlib.Path(env.ARTIFACT_DIR) without verifying that ARTIFACT_DIR is set or that the path exists. If ARTIFACT_DIR is None or an invalid path, the subsequent glob operations could fail silently or raise exceptions.

🛡️ Suggested defensive check
 def generate_notification_files():
     """Generate NOTIFICATION.html and NOTIFICATION.md files with MLflow tracking information."""
     artifact_dir = pathlib.Path(env.ARTIFACT_DIR)
+    
+    if not artifact_dir or not artifact_dir.exists():
+        logger.warning(f"Artifact directory not available or doesn't exist: {artifact_dir}")
+        return
+    
     mlflow_urls = []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/fournos_launcher/orchestration/submit.py` around lines 59 - 161, The
generate_notification_files function uses artifact_dir =
pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 34-49: The import block in projects/core/agentic/on_failure.py is
unsorted causing Ruff I001; fix it by running "ruff check
projects/core/agentic/on_failure.py --fix" or manually reorder the imports in
the module-level import section so that standard library imports (datetime,
functools, json, logging, os, re, sys, warnings, pathlib.Path) come first,
followed by third-party imports (click, httpx, urllib3, yaml,
langchain_core.messages.HumanMessage), and then any local imports, ensure blank
lines between groups and remove any unused imports; after reordering, run the
linter to confirm the I001 is resolved.
- Around line 34-49: The file on_failure.py imports httpx, urllib3, and
langchain_core and dynamically loads langchain_openai.ChatOpenAI, but these
packages are not declared in pyproject.toml; update the project's pyproject.toml
to add httpx, urllib3, and langchain_core to [project].dependencies (or put them
under [project.optional-dependencies] with appropriate extras if optional) and
ensure langchain_openai (or the package that provides ChatOpenAI) is declared as
well so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol
succeed on a fresh install.

In `@projects/core/notifications/send.py`:
- Around line 299-313: The NOTIFICATION.md file read (artifact_dir /
"NOTIFICATION.md" using fournos_notification_md) must be wrapped in a try/except
to avoid raising on I/O/encoding/permission errors; update the block in send.py
around the with open(fournos_notification_md, ...) and fournos_content =
f.read().strip() to catch Exception as e, handle/log the error (consistent with
the existing logging approach used by get_common_message's NOTIFICATION.html
handling), and fall back to returning get_common_message(finish_reason,
f"{status_icon} {status}", get_link, get_italics, get_bold) when an error occurs
or the file is unreadable.

In `@projects/llm_d/orchestration/test_phase.py`:
- Around line 170-181: The _prepare_model_cache function currently returns
immediately making its docstring and body (including the call to
prepare_model_cache()) unreachable; remove the stray return, place the docstring
right after the def if desired, and restore the body so it imports
runtime_config, retrieves model_key via runtime_config.get_model_key(), logs via
logger.info, and calls prepare_model_cache() to perform PVC/token handling;
verify references to _prepare_model_cache(), runtime_config, logger, and
prepare_model_cache are unchanged.
- Around line 93-96: The code unconditionally disables finalizers by setting
do_finalizers = False before checking primary_exc; remove that unconditional
assignment and instead initialize do_finalizers = True (or omit the assignment)
and only set do_finalizers = False when primary_exc indicates a SignalInterrupt
(i.e., keep the isinstance(primary_exc[1], SignalInterrupt) branch and set
do_finalizers = False there); ensure the variable controlling execution of
finalizer callbacks (do_finalizers) is True in the normal flow so finalizer
callbacks (capture state, write endpoint, capture events, cleanup) run unless a
SignalInterrupt is detected.

In `@vaults/psap-models-corp-rh.yaml`:
- Line 8: Fix the typo in the YAML description string "Access to the RH models
use by FORGE agent" by changing "use" to "used" so the line reads "Access to the
RH models used by FORGE agent"; locate and update that description entry in the
psap-models-corp-rh YAML block.

---

Duplicate comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 982-1007: The initial prompt assigned to prompt_content should
explicitly request the 5-field schema (Root Cause / Failed Step / Trigger / Fix
/ Prevent) on the first turn so downstream extract_structured_analysis() and
get_failure_explanations() receive structured fields; modify the string built in
prompt_content inside on_failure.py to append a short line asking for the five
named fields in a fixed format (e.g. "Return: Root Cause: ..., Failed Step: ...,
Trigger: ..., Fix: ..., Prevent: ...") whenever FAILURE and log_content are
provided, ensuring the exact field names match what
extract_structured_analysis() and get_failure_explanations() expect.
- Around line 1219-1299: The LLM-controlled filenames (handled in
resolved_files, resolved_path, and later rglob search) can escape the artifact
tree; fix by canonicalizing and constraining every candidate: when building
resolved_path (and before any rglob), use (base_artifact_dir /
relative).resolve() and ensure resolved_path is within
base_artifact_dir.resolve() (e.g., via startswith or comparison), rejecting or
skipping any path that does not lie under base_artifact_dir; remove use of
user-controlled patterns in rglob (do not pass raw filename patterns to
base_artifact_dir.rglob), instead pre-enumerate all files under
base_artifact_dir into a safe map (path.resolve() values and names) and fall
back only to exact matches from that enumerated set; log and return an error or
skip entries that are outside the artifact tree and never read files outside
base_artifact_dir into file_contents.
- Around line 914-921: process_failure_analysis currently writes every failure
report to the constant "failure_analysis_report.html" (variables
html_filename/html_path), causing later reports to overwrite earlier ones;
change it to generate a unique filename per invocation (for example include the
failure identifier, timestamp, or a uuid in the filename) before constructing
html_path in the same block so each call creates a distinct file and return the
new unique path string from process_failure_analysis.
- Around line 205-218: The HTTP client currently disables TLS verification by
passing verify=False in the httpx.Client call in on_failure.py; replace this
with a secure approach: read a configurable CA bundle path (e.g., from an env
var like HTTPX_CA_BUNDLE) and pass that string to httpx.Client(verify=ca_path)
when present, and only fall back to verify=False when an explicit allow-insecure
env var (e.g., DISABLE_TLS_VERIFY=true) is set; also restrict suppressing
InsecureRequestWarning to the insecure branch and emit a clear log warning when
verify is intentionally disabled. Ensure changes target the http_client
construction site and associated warnings.catch_warnings() usage so the default
behavior enforces certificate validation.

---

Nitpick comments:
In `@projects/core/agentic/on_failure.py`:
- Around line 1559-1564: Replace the hardcoded MODEL_KEY assignment by reading a
configurable environment variable: instead of setting MODEL_KEY =
"qwen-3-6-35b", obtain it via an environment lookup (e.g.,
os.getenv("MODEL_KEY")) with a fallback to the current default string; then use
that variable when calling models_config.get(MODEL_KEY) and raise the same
ValueError if missing. Ensure you import os if not present and keep the existing
models_config lookup and error handling intact.

In `@projects/fournos_launcher/orchestration/submit.py`:
- Around line 59-161: The generate_notification_files function uses artifact_dir
= pathlib.Path(env.ARTIFACT_DIR) without validating env.ARTIFACT_DIR or that the
path exists; add a defensive check at the start of generate_notification_files
to validate env.ARTIFACT_DIR is set (not None/empty) and that
artifact_dir.exists() and artifact_dir.is_dir(); if the check fails, log a clear
warning via logger (e.g., logger.warning) and return early to skip the rest of
the function, preventing runtime errors during artifact_dir.glob calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c8785fe4-d1a4-4e66-9fb9-c8118c5c3877

📥 Commits

Reviewing files that changed from the base of the PR and between 0743170 and 24fa1ab.

📒 Files selected for processing (8)
  • projects/core/agentic/on_failure.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/toolbox.py
  • projects/core/notifications/send.py
  • projects/fournos_launcher/orchestration/submit.py
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/test_phase.py
  • vaults/psap-models-corp-rh.yaml
🛑 Comments failed to post (5)
projects/core/agentic/on_failure.py (1)

34-49: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Fix the import sorting to resolve the pipeline failure.

The Ruff linter reports that the import block is unsorted or unformatted (I001). This is causing the CI pipeline to fail.

Run ruff check projects/core/agentic/on_failure.py --fix to automatically organize the imports, or manually sort them according to PEP 8 conventions (standard library, third-party, local imports).

🧰 Tools
🪛 GitHub Actions: Ruff / 0_lint (3.12).txt

[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with ruff check --fix).

🪛 GitHub Actions: Ruff / lint (3.12)

[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with ruff check projects/ bin/ --fix).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The import block
in projects/core/agentic/on_failure.py is unsorted causing Ruff I001; fix it by
running "ruff check projects/core/agentic/on_failure.py --fix" or manually
reorder the imports in the module-level import section so that standard library
imports (datetime, functools, json, logging, os, re, sys, warnings,
pathlib.Path) come first, followed by third-party imports (click, httpx,
urllib3, yaml, langchain_core.messages.HumanMessage), and then any local
imports, ensure blank lines between groups and remove any unused imports; after
reordering, run the linter to confirm the I001 is resolved.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if httpx, urllib3, langchain-core, langchain-openai are in pyproject.toml

echo "=== Checking pyproject.toml for required dependencies ==="
rg -n 'httpx|urllib3|langchain' pyproject.toml || echo "❌ Missing dependencies detected"

echo ""
echo "=== Dependencies imported in on_failure.py ==="
rg -n '^import (httpx|urllib3)|^from langchain' projects/core/agentic/on_failure.py

Repository: openshift-psap/forge

Length of output: 288


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Imports in on_failure.py (full scan for relevant packages) ==="
rg -n "^(import (httpx|urllib3)|from langchain(_core|_openai)|from langchain|import langchain|import .*langchain|from langchain_openai|import .*openai|from .*openai)" projects/core/agentic/on_failure.py || true
echo

echo "=== pyproject.toml: look for dependency declarations (more precise) ==="
rg -n "(^|\\W)(httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain)(\\W|$)" pyproject.toml || true
echo

echo "=== pyproject.toml: show dependency sections around project.dependencies and optional-dependencies ==="
# Print the first ~200 lines to locate [project] sections quickly (bounded output)
sed -n '1,220p' pyproject.toml | nl -ba
echo

echo "=== pyproject.toml: if optional-dependencies exists, show it (bounded) ==="
if rg -n "^\[project\.optional-dependencies\]" -S pyproject.toml >/dev/null; then
  start=$(rg -n "^\[project\.optional-dependencies\]" -n pyproject.toml | head -n1 | cut -d: -f1)
  end=$((start+220))
  sed -n "${start},${end}p" pyproject.toml | nl -ba
else
  echo "No [project.optional-dependencies] section found."
fi

Repository: openshift-psap/forge

Length of output: 456


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Locate pyproject.toml files ==="
fd -t f -e toml "pyproject.toml" . || true
echo

# Choose the root pyproject.toml if it exists; otherwise fall back to first found
PYPROJECT="$(fd -t f -e toml "pyproject.toml" . | head -n1 || true)"
if [[ -z "${PYPROJECT}" ]]; then
  echo "No pyproject.toml found in repo."
  exit 0
fi
echo "Using: ${PYPROJECT}"
echo

echo "=== Imports in projects/core/agentic/on_failure.py (full scan for relevant packages) ==="
rg -n "(^import (httpx|urllib3)|^from langchain(_core|_openai)?(\.| )|^from langchain import|^from langchain_openai|^from .*openai|^import .*langchain|^import .*openai|^import .*yaml|^import .*click)" projects/core/agentic/on_failure.py || true
echo

echo "=== pyproject.toml: search for declared dependencies (strings) ==="
rg -n "httpx|urllib3|langchain-core|langchain-openai|langchain_core|langchain_openai|langchain" "${PYPROJECT}" || echo "No matching dependency strings found."
echo

echo "=== pyproject.toml: show [project] dependencies sections (bounded) ==="
# Show near likely dependency keys
rg -n "^\[project\]" "${PYPROJECT}" || true
rg -n "^(dependencies|optional-dependencies)\s*=" "${PYPROJECT}" || true

# Print first 260 lines to include [project] metadata and dependency lists (bounded output)
sed -n '1,260p' "${PYPROJECT}" | cat -n
echo

if rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" >/dev/null; then
  start=$(rg -n "^\[project\.optional-dependencies\]" "${PYPROJECT}" | head -n1 | cut -d: -f1)
  end=$((start+260))
  echo "=== [project.optional-dependencies] (bounded) ==="
  sed -n "${start},${end}p" "${PYPROJECT}" | cat -n
else
  echo "No [project.optional-dependencies] section found."
fi

Repository: openshift-psap/forge

Length of output: 7970


🏁 Script executed:

#!/bin/bash
set -euo pipefail
rg -n "langchain_openai|langchain(_core|_openai)?|import .*langchain|from .*langchain" projects/core/agentic/on_failure.py || true

Repository: openshift-psap/forge

Length of output: 235


Declare missing dependencies in pyproject.toml.

projects/core/agentic/on_failure.py imports httpx, urllib3, and langchain_core, and later dynamically imports langchain_openai.ChatOpenAI, but ./pyproject.toml has neither of these packages in [project].dependencies nor in [project.optional-dependencies]. This will cause ImportErrors on a fresh install.

🧰 Tools
🪛 GitHub Actions: Ruff / 0_lint (3.12).txt

[error] 34-49: ruff check failed (I001). Import block is un-sorted or un-formatted. Organize imports (can be fixed with ruff check --fix).

🪛 GitHub Actions: Ruff / lint (3.12)

[error] 34-49: Ruff (I001): Import block is un-sorted or un-formatted. Organize imports (1 error found; fixable with ruff check projects/ bin/ --fix).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/agentic/on_failure.py` around lines 34 - 49, The file
on_failure.py imports httpx, urllib3, and langchain_core and dynamically loads
langchain_openai.ChatOpenAI, but these packages are not declared in
pyproject.toml; update the project's pyproject.toml to add httpx, urllib3, and
langchain_core to [project].dependencies (or put them under
[project.optional-dependencies] with appropriate extras if optional) and ensure
langchain_openai (or the package that provides ChatOpenAI) is declared as well
so imports of httpx, urllib3, langchain_core, and the ChatOpenAI symbol succeed
on a fresh install.
projects/core/notifications/send.py (1)

299-313: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add exception handling for file read operations.

The NOTIFICATION.md file read (lines 304-305) is not wrapped in a try-except block, unlike the NOTIFICATION.html read in get_common_message (lines 215-223). If the file read fails (e.g., permissions issue, encoding error), it will raise an exception and potentially break the Slack notification flow.

🛡️ Suggested error handling
     # Check for fournos_launcher generated Slack notification content
     artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", ""))
     fournos_notification_md = artifact_dir / "NOTIFICATION.md"

     if fournos_notification_md.exists():
-        with open(fournos_notification_md, encoding="utf-8") as f:
-            fournos_content = f.read().strip()
-        if fournos_content:
-            # Return custom notification content with status icon
-            return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+        try:
+            with open(fournos_notification_md, encoding="utf-8") as f:
+                fournos_content = f.read().strip()
+            if fournos_content:
+                # Return custom notification content with status icon
+                return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
+        except Exception as e:
+            logger.warning("Failed to read NOTIFICATION.md: %s", e)

     # Fallback to standard message if no custom notification
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    # Check for fournos_launcher generated Slack notification content
    artifact_dir = pathlib.Path(os.environ.get("ARTIFACT_DIR", ""))
    fournos_notification_md = artifact_dir / "NOTIFICATION.md"

    if fournos_notification_md.exists():
        try:
            with open(fournos_notification_md, encoding="utf-8") as f:
                fournos_content = f.read().strip()
            if fournos_content:
                # Return custom notification content with status icon
                return f"{status_icon} {get_bold(status)}\n\n{fournos_content}"
        except Exception as e:
            logger.warning("Failed to read NOTIFICATION.md: %s", e)

    # Fallback to standard message if no custom notification
    return get_common_message(
        finish_reason, f"{status_icon} {status}", get_link, get_italics, get_bold
    )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/core/notifications/send.py` around lines 299 - 313, The
NOTIFICATION.md file read (artifact_dir / "NOTIFICATION.md" using
fournos_notification_md) must be wrapped in a try/except to avoid raising on
I/O/encoding/permission errors; update the block in send.py around the with
open(fournos_notification_md, ...) and fournos_content = f.read().strip() to
catch Exception as e, handle/log the error (consistent with the existing logging
approach used by get_common_message's NOTIFICATION.html handling), and fall back
to returning get_common_message(finish_reason, f"{status_icon} {status}",
get_link, get_italics, get_bold) when an error occurs or the file is unreadable.
projects/llm_d/orchestration/test_phase.py (2)

93-96: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Finalizers are unconditionally disabled.

do_finalizers = False on line 93 prevents all finalizer callbacks from executing, including:

  • Capturing inference-service state
  • Writing endpoint URL to artifacts
  • Capturing namespace events
  • Cleaning up test resources (pods, jobs, PVCs)

The SignalInterrupt check on lines 94-96 is redundant since do_finalizers is already False. If this is intentional debugging code, consider removing it before merge or adding a comment explaining why finalizers are disabled. Otherwise, resource leaks and missing artifacts will occur.

Suggested fix to re-enable finalizers
     finally:
-        do_finalizers = False
+        do_finalizers = True
         if primary_exc and isinstance(primary_exc[1], SignalInterrupt):
             logging.warning("Caught a SignalInterrupt, skipping the finalizers")
             do_finalizers = False
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/test_phase.py` around lines 93 - 96, The code
unconditionally disables finalizers by setting do_finalizers = False before
checking primary_exc; remove that unconditional assignment and instead
initialize do_finalizers = True (or omit the assignment) and only set
do_finalizers = False when primary_exc indicates a SignalInterrupt (i.e., keep
the isinstance(primary_exc[1], SignalInterrupt) branch and set do_finalizers =
False there); ensure the variable controlling execution of finalizer callbacks
(do_finalizers) is True in the normal flow so finalizer callbacks (capture
state, write endpoint, capture events, cleanup) run unless a SignalInterrupt is
detected.

170-181: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Unreachable code: return before docstring makes function body dead code.

The return statement on line 171 causes all subsequent code (lines 172-180) to be unreachable, including the docstring and the actual prepare_model_cache() call. This bypasses model cache preparation entirely, which per the codebase handles cache validation, PVC setup, and HF token retrieval.

If this is temporary debugging code, please revert before merge. The docstring placement after return also indicates this may have been an accidental edit.

Suggested fix to restore model cache preparation
 def _prepare_model_cache() -> None:
-    return
     """Ensure model cache PVC is ready for deployment."""
     from projects.llm_d.orchestration import runtime_config

     model_key = runtime_config.get_model_key()
     logger.info("Preparing model cache for model: %s", model_key)

     # Use the same prepare_model_cache function as the prepare phase
     # This includes vault token handling and PVC existence checks
     prepare_model_cache()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

def _prepare_model_cache() -> None:
    """Ensure model cache PVC is ready for deployment."""
    from projects.llm_d.orchestration import runtime_config

    model_key = runtime_config.get_model_key()
    logger.info("Preparing model cache for model: %s", model_key)

    # Use the same prepare_model_cache function as the prepare phase
    # This includes vault token handling and PVC existence checks
    prepare_model_cache()
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@projects/llm_d/orchestration/test_phase.py` around lines 170 - 181, The
_prepare_model_cache function currently returns immediately making its docstring
and body (including the call to prepare_model_cache()) unreachable; remove the
stray return, place the docstring right after the def if desired, and restore
the body so it imports runtime_config, retrieves model_key via
runtime_config.get_model_key(), logs via logger.info, and calls
prepare_model_cache() to perform PVC/token handling; verify references to
_prepare_model_cache(), runtime_config, logger, and prepare_model_cache are
unchanged.
vaults/psap-models-corp-rh.yaml (1)

8-8: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix typo in description.

"use" should be "used" for grammatical correctness.

📝 Proposed fix
-      Access to the RH models use by FORGE agent
+      Access to the RH models used by FORGE agent
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

      Access to the RH models used by FORGE agent
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@vaults/psap-models-corp-rh.yaml` at line 8, Fix the typo in the YAML
description string "Access to the RH models use by FORGE agent" by changing
"use" to "used" so the line reads "Access to the RH models used by FORGE agent";
locate and update that description entry in the psap-models-corp-rh YAML block.

@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 01 minutes 59 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster psap-mgmt
/pipeline forge-test-only

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 12:33:29
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-123136 failed: Resolution failed: Job has reached the specified backoff limit


[...]

Execution logs

@kpouget

kpouget commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 10 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260605-130024
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:11:28
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 151, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 51 seconds 🔴

• Link to the test results.

• No reports generated...

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:12:06
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-130024 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@kpouget kpouget force-pushed the agentic branch 2 times, most recently from 0db96b1 to c2e59de Compare June 5, 2026 13:24
@kpouget

kpouget commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@projects/core/ci_entrypoint/run_common.py`:
- Line 32: EXTRA_PACKAGES is currently an unpinned list (EXTRA_PACKAGES =
["httpx", "langchain-openai", "langchain-core"]) which makes CI installs
non-deterministic; update the EXTRA_PACKAGES variable to include explicit
version pins (or modify install_extra_packages() to install using a shared
constraints/lockfile) so CI always installs fixed versions — for example,
replace unpinned names in EXTRA_PACKAGES with versioned strings or change
install_extra_packages() to call pip with --constraint pointing to your
repository's constraints/lockfile (ensure you update any related docs/tests to
reflect the pinned versions).

In `@projects/core/notifications/send.py`:
- Around line 303-309: Wrap the optional NOTIFICATION.md read for
fournos_notification_md in a try/except so I/O/encoding/race errors don't abort
Slack message generation: when evaluating fournos_notification_md.exists()
surround the open/read/strip sequence with a broad exception handler that logs
or ignores the error and falls back to returning the default Slack message
(i.e., only return the custom "{status_icon}
{get_bold(status)}\\n\\n{fournos_content}" when read succeeds and
fournos_content is non-empty); reference the fournos_notification_md variable
and the return site that builds the status_icon/get_bold(status) message to
locate the change.

In `@vaults/psap-models-corp-rh.yaml`:
- Around line 6-17: The vault entry contains an unused format field and a typo:
VaultManager._load_vault_definition only reads the content key's file and
description, so remove or relocate the example block under the format key from
the content entry (move it into a YAML comment or external docs) and correct the
description text from "use by" to "used by"; ensure the remaining content only
includes the keys consumed at load time (file/description) so maintainers aren't
misled by an ignored format field.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 91f0387f-50d1-408e-8966-0282847d630b

📥 Commits

Reviewing files that changed from the base of the PR and between 24fa1ab and c2e59de.

📒 Files selected for processing (10)
  • projects/core/agentic/on_failure.py
  • projects/core/ci_entrypoint/run_common.py
  • projects/core/dsl/runtime.py
  • projects/core/dsl/toolbox.py
  • projects/core/notifications/send.py
  • projects/fournos_launcher/orchestration/submit.py
  • projects/llm_d/orchestration/ci.py
  • projects/llm_d/orchestration/config.yaml
  • projects/llm_d/orchestration/test_phase.py
  • vaults/psap-models-corp-rh.yaml
🚧 Files skipped from review as they are similar to previous changes (5)
  • projects/fournos_launcher/orchestration/submit.py
  • projects/llm_d/orchestration/test_phase.py
  • projects/core/dsl/runtime.py
  • projects/llm_d/orchestration/ci.py
  • projects/core/agentic/on_failure.py

Comment thread projects/core/ci_entrypoint/run_common.py
Comment thread projects/core/notifications/send.py
Comment thread vaults/psap-models-corp-rh.yaml
@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'llm_d test' failed after 00 hours 10 minutes 11 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260605-132554
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:37:00
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 151, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 11 minutes 49 seconds 🔴

• Link to the test results.

• No reports generated...

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 13:37:34
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-132554 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@kpouget

kpouget commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'llm_d test' failed after 00 hours 11 minutes 05 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260605-161056
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 16:21:59
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 151, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@psap-forge-bot

psap-forge-bot Bot commented Jun 5, 2026

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 12 minutes 43 seconds 🔴

• Link to the test results.

• No reports generated...

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-05 16:23:31
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:184
   RuntimeError: Job forge-llm-d-20260605-161056 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@kpouget kpouget force-pushed the agentic branch 4 times, most recently from 895e932 to 9659c59 Compare June 19, 2026 14:01
@kpouget

kpouget commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'llm_d test' failed after 00 hours 11 minutes 28 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260619-140210
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d

Failure indicator:

## /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-19 14:13:18
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

TASK EXECUTION DETAILS:
  Task Name: wait_pods_appear
  Task Description: Wait for llm-d pods to appear
  Task Location: projects/kserve/toolbox/deploy_llmisvc/main.py:127
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
  Original Exception: RetryFailure
  Original Message: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

ENVIRONMENT CONTEXT:
  Artifact Directory: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc

STACK TRACE:
----------------------------------------
Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/task.py", line 179, in _execute_with_retry
    raise RetryFailure(
projects.core.dsl.task.RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/forge/projects/core/dsl/runtime.py", line 172, in execute_tasks
    _execute_single_task(current_task_info, args, shared_context)
  File "/app/forge/projects/core/dsl/runtime.py", line 362, in _execute_single_task
    raise task_error from e
projects.core.dsl.runtime.TaskExecutionError: ❌ TASK FAILURE: wait_pods_appear: Wait for llm-d pods to appear
   projects/kserve/toolbox/deploy_llmisvc/main.py:127
   RetryFailure: All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)

================================================================================

## /workspace/artifacts/000__test/FAILURE 
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
~~ projects/kserve/toolbox/deploy_llmisvc/main.py:127
~~ TASK: wait_pods_appear: Wait for llm-d pods to appear
~~ ARTIFACT_DIR: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ LOG_FILE: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/task.log
~~ ARGS:
~~     namespace: forge-llm-d
~~     inference_service_manifest_path: /workspace/artifacts/000__test/001__llmd_test/artifacts/llminferenceservice.yaml
~~     inference_service_name: llm-d
~~     gateway_status_address_name: gateway-external
~~     artifact_dir: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc
~~ CONTEXT:
~~     src_manifest_path: /workspace/artifacts/000__test/001__llmd_test/001__deploy_llmisvc/src/llminferenceservice.yaml
~~     service_name: llm-d
~~     selector: app.kubernetes.io/name=llm-d
~~
~~ EXCEPTION: RetryFailure
~~     All 120 attempts failed for task wait_pods_appear : Wait for llm-d pods to appear (last result: False)
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


[...]

Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 13 minutes 07 seconds 🔴

• Link to the test results.

• No reports generated...

Task Logs:

Test configuration:

/test fournos llm_d
/cluster forge-smoke-testing
/pipeline forge-test-only

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-19 14:15:09
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260619-140210 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@kpouget

kpouget commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d smoke
/cluster forge-smoke-testing
/pipeline forge-test-only

@kpouget

kpouget commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d smoke
/cluster forge-smoke-testing
/pipeline forge-test-only

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'llm_d test' failed after 00 hours 16 minutes 44 seconds 🔴

• Link to the test results.

• Caliper postprocess completed but no reports generated.

Test configuration:

ci_job.cluster: forge-smoke-testing
ci_job.exclusive: true
ci_job.fjob: forge-llm-d-20260623-114926
ci_job.name: llm_d
ci_job.owner: kpouget
project.args: []
project.name: llm_d
runtime.default_preset: smoke

TEST DESCRIPTION:

This test executes a smoke evaluation on the llm_d project by deploying the Qwen3-0.6B model using the approximate scheduler profile. It validates basic inference readiness by sending a single default completion request to confirm service stability and model responsiveness.

FAILURE REVIEW 002 deploy llmisvc:

The KServe LLMInferenceService deployment workflow failed at the wait_pods_appear task after exhausting all 120 retry attempts because the expected llm-d pods were not detected in the forge-llm-d namespace. The failure resulted from a label mismatch where the query selector app.kubernetes.io/name=llm-d did not correspond to the labels assigned to the provisioned resources, causing the validation logic to consistently return false despite successful API communication.

Detailed failure analysis report

Config review report

Raw failure details
Execution logs

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 18 minutes 40 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260623-114926: Failed - Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0

MLflow Tracking:

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d smoke
/cluster forge-smoke-testing
/pipeline forge-test-only
/test fournos llm_d smoke

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-23 12:07:57
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260623-114926 failed: Tasks Completed: 2 (Failed: 1, Cancelled 0), Skipped: 0


[...]

Execution logs

@kpouget

kpouget commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d smoke
/cluster forge-smoke-testing
/pipeline forge-test-only

@psap-forge-bot

Copy link
Copy Markdown

🔴 Test of 'fournos_launcher submit' failed after 00 hours 00 minutes 31 seconds 🔴

• Link to the test results.

FOURNOS Job Status:

  • forge-llm-d-20260623-123539: Failed - Resolution failed: Job has reached the specified backoff limit

FOURNOS Job Details:

Task Logs:

Test configuration:

/test fournos llm_d smoke
/cluster forge-smoke-testing
/pipeline forge-test-only
/test fournos llm_d smoke

Failure indicator:

## /logs/artifacts/001__submit_and_wait/FAILURE 
================================================================================
TASK EXECUTION FAILURE
================================================================================
Timestamp: 2026-06-23 12:36:04
Exception Type: TaskExecutionError
Exception Message: ❌ TASK FAILURE: wait_for_job_completion: Wait for FOURNOS job to complete
   projects/fournos_launcher/toolbox/submit_and_wait/main.py:186
   RuntimeError: Job forge-llm-d-20260623-123539 failed: Resolution failed: Job has reached the specified backoff limit


[...]

Execution logs

@openshift-ci

openshift-ci Bot commented Jun 23, 2026

Copy link
Copy Markdown

@kpouget: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/fournos 7552704 link true /test fournos

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@kpouget

kpouget commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

/test fournos llm_d smoke
/cluster forge-smoke-testing
/pipeline forge-test-only

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant