Skip to content

Apps: version pinning with immutable snapshots, detail pages, service runtime, and job lifecycle hardening#402

Open
krokicki wants to merge 89 commits into
mainfrom
more-apps
Open

Apps: version pinning with immutable snapshots, detail pages, service runtime, and job lifecycle hardening#402
krokicki wants to merge 89 commits into
mainfrom
more-apps

Conversation

@krokicki

@krokicki krokicki commented Jul 4, 2026

Copy link
Copy Markdown
Member

This PR extends the Apps feature with version pinning backed by immutable snapshots, dedicated detail pages for apps, catalog listings, and jobs, a launcher-free runtime for service apps, and a hardened job lifecycle and worker environment.

App version pinning and immutable snapshots

  • Apps are pinned to a commit SHA when added (user_apps.commit_sha, plus code_commit_sha for manifests whose code lives in a separate repo_url repo). Jobs run from an immutable per-commit checkout materialized under ~/.fileglancer/apps/<owner>/<repo>/.snapshots/<sha> as a cheap hardlink clone, so a moving branch or a sibling app's update can never change what a job executes. Legacy unpinned rows are pinned on first launch.
  • Updating is an explicit per-app action that repoints the pin; already-running jobs and other apps in the same repo keep the tree they were pinned to. A new GET /api/apps/check-updates endpoint compares pins against remote tips (one batched git ls-remote round-trip through the user's worker) and the UI shows update-available badges.
  • Snapshots that no app pin or retained job references are garbage-collected, with a grace window for in-flight launches and background deletion so large trees on NFS don't stall the worker.
  • Jobs record the commit they executed (commit_sha, code_repo_url) and the job page links it to GitHub.

Adding apps

  • New POST /api/apps/discover walks a repo for manifests; the Add App dialog gains a selection step for multi-app repos so users choose which apps to add, and the launch page installs only the app being viewed.
  • Manifests are validated at add time: repo_url must be a well-formed GitHub URL and at least one runnable is required.
  • A trust notice warns that apps execute arbitrary code as the user, shown both when adding from a URL and from the catalog.

App, catalog, and job pages

  • New app detail page (/apps/detail/:owner/:repo) and catalog listing detail page (/apps/catalog/:listingId) replace the old info dialogs, with metadata tables, entry-point lists, GitHub links for manifests and commits, and recent jobs (linked both ways between app and job pages). A shared AppPageHeader unifies the app, launch, and job page headers.
  • My Apps and the App Catalog get card/table view toggles with aligned column sets; the catalog gets search and a "hide installed" filter.
  • Catalog listings can be edited after sharing (name, description, URL) and unshared from the listing itself; the add-from-catalog dialog shows the sharer, share date, and GitHub source.
  • The job detail page is rebuilt around tabs (Overview, Parameters, Script, Output Log, Error Log) with lazy per-tab file loading, browse links and downloads for job files, runtime/queue-wait durations, and exit-code explanations.
  • Destructive actions — cancel/delete job, remove app, unshare listing — ask for confirmation first.

Service jobs

  • Generated service scripts export FG_HOSTNAME, FG_SERVICE_PORT (a free port picked on the compute node), and a random FG_SERVICE_TOKEN, so a service can bind to a known address and authenticate callers without a custom launcher script.
  • With auto_url, Fileglancer publishes the service URL itself once the port accepts connections, with an optional shell-safety-validated service_url_suffix that may splice in the token. Services that manage their own URL keep writing SERVICE_URL_PATH as before.
  • Service startup phases are reported to the UI: while a container image is downloading the job page says so instead of a generic spinner.
  • The jobs listing fetches all service URLs and phases in one batched worker action instead of one round-trip per job.

Job lifecycle

  • UNKNOWN scheduler status is handled as a first-class state: displayed neutrally (not as "Failed"), kept polling as active, and only marked FAILED after a configurable cutoff (apps.unknown_timeout_hours, default 24h) measured from the new status_updated_at column.
  • Cancelling a local-executor job now actually stops it: the whole process tree is signalled and termination is confirmed before the job is recorded KILLED; cancel/stop failures are reported on the job page instead of silently swallowed.
  • Deleting a job also deletes its work directory, with strict path-shape guards so a corrupt DB row can't turn deletion into arbitrary filesystem removal. Submit-time failures clean up the half-created job row and work dir instead of leaving phantom PENDING jobs.
  • Inline job-file reads are capped to a 5 MB tail with an omission marker, so a runaway log can't exhaust server memory or the IPC limit; the full file remains reachable via its browse link. Log views refetch once when a job goes terminal so the last output isn't stale.

Launch form and parameters

  • Path parameters gain an exists flag: exists: false marks outputs that need not exist before launch, and output directories are created as the target user at submit time.
  • Server-side path validation now checks file-vs-folder type, and the Browse dialog flags a mismatch immediately; it also opens at the parameter's current value.
  • Env-tab parameters are a fully separate namespace (values, errors, DOM ids), so a pipeline --profile can coexist with Nextflow's -profile.
  • Nextflow pipeline params now use --param=value when needed, so values that start with - (for example --runtime_opts=--nv) stay attached to the intended parameter; this is also enforced at command-build time for older cached manifests.
  • Nextflow boolean pipeline params are emitted explicitly as --flag=true or --flag=false, so a user-selected false overrides a pipeline default of true while generic non-Nextflow boolean switches keep their existing omit-false behavior.
  • Scheduler extra_args are shlex-tokenized, so quoted LSF resource strings like -R "select[mem>8000]" reach the scheduler as single arguments; quoting survives relaunch and params export. Relaunch restores all three form tabs, including env parameters.
  • Non-field validation errors appear in the form's error banner, and parameter labels are associated with their inputs.
  • The file selector dialog gets a home-directory default, a toolbar, and an editable path display box.

Worker and security hardening

  • The per-user worker environment is built from an allowlist instead of a blocklist, so server secrets can't leak to user processes via the environment; FGC_* is never passed through. Site-specific variables can be granted with the new apps.worker_env_passthrough setting.
  • User-supplied manifest paths, commit SHAs, and container cache dirs are strictly validated before reaching disk operations or generated scripts.
  • Git-heavy worker actions (clone, snapshot, discovery) get longer IPC timeouts, and update checks fail fast so a slow GitHub can't hang the worker's serial request loop.

Configuration and migrations

  • New settings: apps.worker_env_passthrough and apps.unknown_timeout_hours, documented in docs/config.yaml.template.
  • Two new Alembic migrations: a7e2f9d31c04 (jobs status_updated_at) and f4a1d8c62e97 (commit pinning columns on apps and jobs). Run pixi run migrate after upgrading.

Tests

  • New backend suites for snapshots (test_snapshots.py) and worker actions including process-tree cancellation (test_worker.py), plus substantial additions to the apps, endpoints, catalog, and poll-loop suites (~2,400 test lines added).
  • New frontend unit tests for app URL parsing, app icons, and job status handling.

@StephanPreibisch @JaneliaSciComp/fileglancer

krokicki and others added 30 commits July 1, 2026 17:26
Two backward-compatible improvements so container-based services (e.g.
code-server) need no launcher script:

- Auto-bind the cached repo clone into the container when a runnable
  resolves working_dir to 'repo'. Previously the repo symlink dangled
  inside containers (only the work dir and param paths were bound), so the
  documented `working_dir: repo` escape hatch was unusable. Container
  bind-path computation is extracted into _container_bind_paths().

- For service-type jobs, emit a preamble that picks a free TCP port on the
  compute node and exports it as FG_SERVICE_PORT (with FG_HOSTNAME). New
  service-only `auto_url` field makes Fileglancer write
  http://$FG_HOSTNAME:$FG_SERVICE_PORT to SERVICE_URL_PATH, so a service
  that binds $FG_SERVICE_PORT needs no URL-writing code of its own.

Net effect: a container service becomes a one-liner, e.g.
  command: code-server --bind-addr 0.0.0.0:$FG_SERVICE_PORT
with auto_url: true and container: ghcr.io/coder/code-server.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adding a repo that contains more than one app now lets the user pick which
apps to add instead of always adding all of them.

Backend:
- New POST /api/apps/discover returns the repo's manifests (path, name,
  description, already_added) without adding anything.
- AppAddRequest gains optional manifest_paths; POST /api/apps adds only that
  subset when provided, or all discovered manifests when omitted (unchanged
  default). Shared _discover_repo_manifests keeps error handling identical
  across both endpoints.

Frontend:
- AddAppDialog is now two-step: enter the URL, then — only when the repo has
  more than one app — a checklist of apps with a Select all / Deselect all
  toggle and an "Add Selected (N)" action. Single-app repos add directly as
  before. Already-added apps are shown checked and disabled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In the multi-app selection list, use foreground (black) text instead of the
purple secondary color for the description, the "already added" tag, and the
intro line, and remove the manifest-path line from each row.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Command row in the job page's Execution card wrapped across many lines.
Add an opt-in `truncate` mode to InfoRow (single line, ellipsis, full text on
hover) and use it for Command.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The Overview "Recent output" panel was gated on job.started_at, so it stayed
hidden while a job was still PENDING (started_at null) even though stdout.log
already had content — while "Recent errors" showed because it gates on stderr
content. Gate stdout on content too (hasStdout), mirroring stderr, so output
appears as soon as there is any. Drops the now-unused stdoutPending prop.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The poll loop slept a full poll_interval before clearing _poll_task and
returning when it found no active jobs. A job submitted during that window
saw _poll_task still set (not done), so submit_job's ensure_poll_loop() no-op'd
— then the loop exited, leaving no poller and the job stuck in PENDING forever
(never advancing to RUNNING, so services never surfaced their service_url).

Decide to stop *before* sleeping, clear _poll_task with no await in between,
and re-check for active jobs (catching a job submitted during the same cycle)
so the loop keeps running instead of exiting. Also wrap the locked section in
try/finally so the poll lock can't leak if the task is cancelled while held.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Rewrite comments in the poll-loop stop path, the discover/add endpoint, and the
poll-loop regression test so they explain what the code does and why, without
referring to prior implementations.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_suffix

Make publishing a service URL a first-class part of auto_url instead of
per-app pre_run boilerplate:

- auto_url now waits until $FG_SERVICE_PORT accepts a connection before writing
  SERVICE_URL_PATH (background probe, bounded, logs to stderr on give-up), so
  the link never appears while the container image is still pulling or the
  server is still binding.
- Mint FG_SERVICE_TOKEN (URL-safe) alongside FG_SERVICE_PORT/FG_HOSTNAME, so a
  service can use it for auth and splice it into the URL.
- New service_url_suffix: a restricted template (literal URL text plus the
  ${FG_SERVICE_TOKEN}/${FG_SERVICE_PORT}/${FG_HOSTNAME} placeholders) appended
  to the published URL for one-click token auth. Validated for shell-safety and
  requires auto_url. Scheme stays http for now.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtup

A container service can sit for minutes pulling its image before the URL
appears, with nothing telling the user why. The generated job script now
reports its phase: it writes 'pulling_image' to $FG_PHASE_PATH (a 'phase' file
in the work dir) around the apptainer pull, then 'starting'. The worker reads
it alongside the service URL, the Job model carries a `phase` field, and the
job page shows "Downloading container image… first launch can take a few
minutes" instead of a bare "starting up".

Since Fileglancer generates the script, the phase is emitted deterministically
at the pull instruction (only when the SIF isn't cached) rather than scraped
from Apptainer's logs. The phase file rides the same work-dir read path as
service_url, so no new access-model or extra worker dispatch.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The generated script only pulls when its own SIF is missing, so Apptainer's
internal layer/SIF cache (~/.apptainer/cache) mostly just duplicated every
image — gigabytes that sped up a re-pull that rarely happens. Add
--disable-cache to `apptainer pull` so the .sif we keep is the only copy; a
re-pull (if that .sif is deleted) re-downloads from the registry.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job detail route (apps/jobs/:jobId) was a standalone route outside
AppsLayout, so drilling into a job dropped the My Apps / App Catalog / Jobs tab
bar. Nest it as a child of the AppsLayout route so it renders in the layout's
Outlet with the tabs; the Jobs tab (NavLink to /apps/jobs, non-exact) stays
highlighted. The URL is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The job detail page now renders inside AppsLayout with the tab bar, so the
Jobs tab already provides the way back. Remove the button and its now-unused
icon import.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add an opt-in per-parameter boolean create_if_missing (directory params
only). When set, Fileglancer creates the resolved directory as the user,
within an allowed file share, just before the submit-time existence
check — so a home default like ~/.fileglancer/logs works on first launch
and overrides to new directories are created too.

Backend: model field + validator (directory-only), collect_creatable_dirs
mirroring collect_path_parameters, a create_dirs setuid-worker action
that enforces file-share containment before makedirs(exist_ok=True), and
a submit_job step that dispatches create_dirs before validate_paths.
Frontend type parity in shared.types.ts.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two improvements to the file/folder selector dialog (useFileSelector +
FileSelectorButton):

1. Start in the user's home directory when opened with no path yet
   (opt-in via defaultToHome, enabled on the app launch form's
   directory/file pickers), instead of the top-level zones list.

2. Add "go home", "new folder", and "show/hide dot files" buttons
   mirroring the browser toolbar. Show/hide dot files is a dialog-local
   override that follows the global preference until toggled and never
   writes it. New folder creates a directory in the current file share.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The launch form validates directory paths against the worker before
submitting, and that existence check rejected a create_if_missing
default like ~/.fileglancer/tensorboard with "Path does not exist" —
before submit_job's create_dirs step could create it.

Pass the create_if_missing param keys through /api/apps/validate-paths
so the worker validates those for file-share containment only (not
existence). The directory is still created at submit time, and paths
outside any share are still rejected early.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Default to hiding dot files in the dialog regardless of the global
preference, move the filter onto the toolbar row, drop the bottom
"Selected" display, widen the dialog, and scale the file-list height
with the viewport so shorter screens shrink the scrollable list.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…unch

Unify the apps UX around a drill-down hierarchy (My Apps > app detail >
entry point launch) with the apps tabs visible throughout:

- App cards get a banner: entry-point type icon (server for services,
  terminal for jobs) and Shared tag on the left, Launch button and a
  vertical-ellipsis actions menu (Launch/View/Share|Unshare/Update/
  Remove) on the right, replacing the corner icon buttons and bottom
  Launch button. The whole card is clickable and opens the detail page.
- New app detail page replaces AppInfoDialog: info table, per-entry-
  point Launch buttons, visible Share/Unshare and Update buttons, and
  Remove tucked in the ellipsis menu.
- Launch/relaunch routes move inside AppsLayout so the tabs stay
  visible; the "Back to Apps" button is replaced by a shared
  back-arrow + app-name header. Back arrows walk up one level:
  launch -> detail (when installed), detail -> My Apps.
- My Apps tab stays highlighted on detail/launch/relaunch pages.
- App action mutations + share/remove dialogs consolidated into a
  reusable useAppActions hook shared by the cards and detail page.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…menu

Bring the App Catalog in line with the My Apps card design:

- Listing cards get the same banner treatment: type icon (from the
  installed copy's manifest when available) and an "In your apps" tag
  on the left, an Add button and vertical-ellipsis actions menu
  (Add to my apps/View/View in My Apps/Unshare) on the right. The
  whole card is clickable and opens the listing detail page.
- New listing detail page at /apps/catalog/:listingId replaces
  ListingInfoDialog: info table plus Add to my apps, with owner-only
  Unshare tucked in the ellipsis menu (navigates back to the catalog
  after unsharing). Installed listings link to the app's My Apps
  detail page.
- Listing mutations consolidated into a useListingActions hook shared
  by the catalog cards and the detail page.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Replace JobDetail's hand-rolled header with AppPageHeader so all three
apps drill-downs (app detail, launch, job detail) share the same
back-arrow + icon + title idiom: back arrow to /apps/jobs, entry-point
type icon, job title with the status badge in the badge slot, and the
Export params / Cancel / Relaunch buttons in the actions slot.

Adds getEntryPointTypeIconType so the icon can be derived from a job's
entry_point_type without a manifest.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Listings don't carry the manifest, so the listing detail page fetches a
manifest preview (POST /api/apps/manifest) and renders the same entry
points section as the app detail page, with per-entry-point Launch
buttons and the correct type icon in the header. Installed listings
use the installed copy's manifest without the extra fetch.

Extracts the entry points section into a shared EntryPointsList
component used by both detail pages.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The app detail page now lists the app's five most recent jobs (status
badge, entry point, submission time) linking to their job detail pages,
with a "View all" link when there are more. Conversely, the App row in
the job detail Execution panel links to the app's detail page (falling
back to plain text for unparseable app URLs).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The launch page header already shows the app name (with the user's
custom name preferred), so the form's app-name subtitle under the entry
point title was redundant. Removing it makes the appName and manifest
props of AppLaunchForm unused, so they are removed too.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The menu now serves data links, NG links, jobs tables, both app card
types, and two detail-page headers, so the data-links name was
misleading. Mechanical rename of the file, component, and props type;
no behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Previously all apps from one repo shared a single mutable clone, so
updating any app silently changed the code every sibling app ran (while
their cached manifests went stale), and a pull could mutate the tree
under a running job.

Each app now pins a commit_sha at add time. Jobs run from immutable
per-SHA snapshots (hardlink clones under .snapshots/, cheap in time and
disk), so Update re-pins only the targeted app: siblings and running
jobs keep the exact tree they were pinned to. Manifests with a separate
repo_url get their code repo pinned too (code_commit_sha), and jobs
record the commit they executed for provenance on the job detail page.

Unreferenced snapshots are garbage-collected opportunistically after
update/remove/submit. The keep-set spans app pins, non-terminal jobs
(UNKNOWN counts as live), and terminal jobs younger than 14 days so
recent work-dir repo symlinks stay browsable; a one-hour mtime grace
period plus a hot-path touch protect in-flight launches and
cross-process creation. Deletion renames to .trash-* and finishes on a
daemon thread so the worker's serial loop never stalls; stale trash is
re-swept on later snapshot creation.

A new /api/apps/check-updates endpoint compares pins against remote
tips (one batched worker call, concurrent ls-remotes, 10s timeout) and
powers an "Update available" badge on app cards and the detail page;
bare legacy URLs resolve the remote HEAD rather than assuming main.
Updating now toasts "Updated to <sha>" or "Already up to date", and
legacy unpinned rows are backfilled at first launch. If a pinned
snapshot can't be rebuilt (cache wiped and commit rewritten away),
manifest reads fall back to the branch clone instead of failing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The f4a1d8c62e97 migration briefly existed without jobs.code_repo_url,
so a dev database migrated during that window is stamped at this head
while missing the column, and the server fails on any jobs query. Guard
each add/drop with a column-existence check so re-running the revision
(after `alembic stamp c1f9a4e7b2d8`) converges any such state.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The monospace commit sha rendered noticeably larger than adjacent
text at the same declared size, throwing off row spacing on the
app detail and job detail pages.

Co-Authored-By: Claude Sonnet 5 <noreply@anthropic.com>
My Apps can now be viewed as a sortable, filterable table (same
TableCard used by the Jobs views) in addition to the existing card
grid. The toggle sits on the Add from URL row and the chosen mode
persists in localStorage.

Table columns: repository (org/repo linking to GitHub), revision, name
(links to app detail, with update-available badge), truncated
description, shared badge, and the same actions menu as the cards.

Menu items, the shared badge, and appRevision() are extracted into
shared helpers so card and table modes stay in sync.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Keep the pill on cards and the detail page only; the table's name cell
no longer mounts the check-updates query.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
krokicki and others added 30 commits July 4, 2026 09:04
The cancel/stop confirmation fired the mutation and closed the dialog with
no success or error feedback (JobDetail imported no toast). A failed stop of
a running service looked identical to success while the service kept
consuming cluster resources. Await the mutation, toast success, surface
errors via showErrorToast, and keep the dialog open on failure so the user
can retry.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adding an app (from a URL or the shared catalog) runs code from its
repository on the cluster as the current user, but the UI never said so —
a real trust gap for catalog apps published by other users. Add a concise
trust notice (new AppTrustNotice component) to the add-from-URL dialog and
the catalog listing detail page. Source repo/revision/commit are already
shown in the app and listing info tables.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two manifest-load hardening checks that turn confusing downstream failures
into clear author-facing errors:

- repo_url now must parse as a GitHub URL (same parser as app URLs) instead
  of any string that later fails cryptically in ensure_repo_snapshot at
  launch/update.
- runnables requires min_length=1, so a manifest with no entry points is
  rejected at parse time rather than producing an app with a Launch button
  and nothing to launch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
JobStatusBadge fell back to the FAILED style for any unmapped status. The
cluster API's UNKNOWN status (which the server treats as live) would render
as "Failed", misrepresenting a running/unknown job. Fall back to a neutral
"Unknown" badge instead.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
useJobFileQuery stopped polling the moment a job became terminal, but the
last output and the scheduler epilogue are written right around then, so the
viewed log/Overview tail was left stale until remount. Do one final refetch
on the active->terminal edge (only for the tab currently being viewed).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Job deletion (irreversible record + log removal) fired on a single row-menu
click. Add a DeleteJobDialog confirmation, matching the cancel/stop flow,
that reports success/failure and keeps open on error.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unshare deleted a listing on a single click, including from an overflow menu
next to Edit. Unsharing affects other users (they can no longer see or add
the app) and discards the listing's curated name/description, so gate it
behind a confirmation. Add a two-step requestUnshare/confirmUnshare flow to
both useAppActions and useListingActions (mirroring the existing
remove/edit flows) with a shared UnshareDialog, and route the app/listing
buttons and menu items through it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Parameter field labels used htmlFor="param-<key>" but the inputs set no
matching id, so clicking a label didn't focus its field and screen readers
got no association — add the id to each input type. Also give the catalog
search box an aria-label.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Show the "not in your library" banner only once /api/apps has loaded, so a
  failed apps query doesn't flag every app (installed ones included).
- Suppress a misleading "0 apps added" success toast.
- Guard the AddAppDialog Enter handlers with the same disabled condition as
  the Continue button to avoid a double discover/add.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The previous local-cancel fix signalled only the recorded launcher bash PID.
bash does not forward SIGTERM to its foreground child, so the real workload
was orphaned and kept running while the DB was marked KILLED unconditionally
— Fileglancer reported success for a job that was still alive.

cancel_local now snapshots the launcher's full descendant tree before
signalling (children reparent to init once the launcher dies and can no
longer be traced), SIGTERMs the tree, then SIGKILLs any survivors after a
grace period, and reports whether the workload is actually gone (liveness is
/proc-based so a not-yet-reaped zombie counts as dead). cancel_job only
records KILLED when termination is confirmed; otherwise it raises, and the
UI surfaces the failure instead of a false success.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Stripping FGC_* protected Fileglancer's own settings but left generic
deployment secrets in the server environment (AWS_SECRET_ACCESS_KEY,
GITHUB_TOKEN, DATABASE_URL, ...) readable by any user via the worker's
/proc/<pid>/environ. Switch to an allowlist: only pass variables the worker
legitimately needs (PATH/HOME/locale/tmp/proxy/SSH, scheduler LSF_/SLURM_/
SGE_/PBS_, environment modules, conda/pixi, containers, PYTHONPATH). FGC_* is
dropped unconditionally, even if listed in passthrough.

Site-specific vars go through the new settings.apps.worker_env_passthrough
(exact names or '_'-suffixed prefixes) rather than widening the allowlist in
code.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Use shlex.join when exposing cluster default extra_args so values with spaces or scheduler metacharacters round-trip through the launch form and backend shlex.split handling. Add endpoint coverage for quoted args.

Co-authored-by: Codex <codex@openai.com>
The code-execution trust notice only appeared in the add-from-URL dialog and
(inline) on the listing detail page. The catalog card "Add" button and the
row/card overflow menu called useListingActions.add() directly, installing an
app — which runs its repo's code as the user on launch — with no warning.

Route all catalog adds (card button, menu, listing detail) through a
requestAdd/confirmAdd flow backed by a new AddFromCatalogDialog that shows
the source repo/revision and the trust notice at the point of adding. The
now-redundant inline notice on the listing detail page is dropped in favor of
the dialog, so every add path surfaces the warning consistently.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The catalog add confirmation now shows who shared the app and when, and
renders the source as a GitHub link with the brand icon (the same
presentation used in the listing info table) instead of a plain "owner/repo"
label. Extract that GitHub-URL-with-icon presentation into a shared
GithubUrlValue component and reuse it in both places.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The a11y fix used id="param-<key>" / htmlFor="param-<key>", but the same
parameter key is allowed in both the pipeline and env-tab namespaces (e.g. a
pipeline --profile alongside Nextflow's -profile). That produced duplicate
DOM ids and labels that could focus the wrong input. Thread a namespace
prefix (param-main vs param-env) through ParameterFieldRow/SectionContent so
each input id is unique.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Centralize job terminal-status handling around DONE, FAILED, and KILLED so UNKNOWN and scheduler-specific statuses keep polling, remain cancellable, and cannot be deleted as if terminal. Add backend and frontend coverage.

Co-authored-by: Codex <codex@openai.com>
Remove a terminal job's stored work directory through the user worker before deleting its database record, with safety checks around the expected Fileglancer jobs path. Update the delete dialog to warn that the entire working directory, including logs, is removed and cover the behavior in endpoint tests.

Co-authored-by: Codex <codex@openai.com>
Since UNKNOWN jobs are now treated as active, a job the scheduler can no
longer report (aged out of the queue/history) would be polled forever. Add a
configurable cutoff (apps.unknown_timeout_hours, default 24) in the poll
loop: a job that has sat in UNKNOWN longer than the cutoff is marked FAILED
and dropped from polling, alongside the existing zombie-timeout handling.

To measure time-in-UNKNOWN accurately (not since creation), add a
status_updated_at column set on every status change, with an alembic
migration; the cutoff falls back to created_at for pre-migration rows. Users
can still clear a stuck job sooner by cancelling it.

The complementary upstream fix — resolving aged-out jobs from scheduler
history so they rarely become UNKNOWN — is planned in py-cluster-api.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolve user-preferred Apptainer cache paths like ~/... to the target user's home before shell-quoting them, so the generated job script does not create a literal ~ directory. Add container script tests for tilde expansion and paths with spaces.

Co-authored-by: Codex <codex@openai.com>
Render explicit error states when catalog listing entry-point previews or job file loads fail, instead of treating failed fetches as missing data/files.

Co-authored-by: Codex <codex@openai.com>
Normalize exported job launch params so scheduler extra args are written as top-level extra_args, matching the import path used by AppLaunchForm, instead of being nested under resources.

Co-authored-by: Codex <codex@openai.com>
Delete the newly created job row if validation, script assembly, or worker submission fails after the row has been inserted, preventing orphan PENDING jobs from lingering in the jobs list.

Co-authored-by: Codex <codex@openai.com>
submit_job's assembly — the script text and resource spec dispatched to
the worker, which is what actually runs as the user — previously had
only failure-path tests. TestSubmitJobAssembly covers the script layout
(preamble, conda activation, env exports, pre/post_run ordering,
parameter shell-quoting, local-executor exit-code trap), service
preambles, container wrapping with default-path binds, resource
overrides with extra_args tokenization, and worker path-validation
errors mapped back to parameter names.

At the HTTP level, POST/GET /api/jobs, GET /api/jobs/{id} (including
derived file paths), POST .../cancel, GET .../files/{type}, and
POST /api/apps/validate-paths were untested; each now has endpoint
tests, with the service-URL/phase and job-file reads exercising the
real in-process worker actions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
When server-side path validation could not run, the failure was stored
under the `_general` error key, which matches no parameter field: the
banner claimed there were highlighted errors while nothing was
highlighted and the real message was never shown. The banner now prints
the message of any error key that has no corresponding field, and only
claims highlighted errors when a field-keyed error actually exists.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
GET /api/jobs dispatched one get_service_url worker action per running
service job. The endpoint is polled every few seconds and the worker
executes one action at a time, so each running service added a serial
IPC round-trip that also stalled the user's file browsing.

The listing now sends all running service jobs to a single
get_service_urls action. The server passes each job's stored work dir
(rows already username-scoped by the DB query); the legacy fallback for
rows predating stored work dirs is resolved in the worker so '~'
expands to the target user's home. The single-job action used by
GET /api/jobs/{id} is unchanged, and both now share the same
url/phase file readers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Batch of remaining small items from the apps feature review:

- Preserve custom app names on manifest cache refresh: refreshes now
  sync only the manifest column (new update_user_app_manifest_cache)
  instead of upserting name/description, so a catalog app added under
  a custom name no longer reverts to the raw manifest name when its
  cache is refilled after schema drift or a NULL backfill.
- Restrict AppParameter.flag to a conservative CLI-flag pattern; flags
  are emitted into the job script unquoted and the Nextflow adapter
  derives them from schema property names.
- Validate env-tab parameters client-side: the same rules as pipeline
  params (required/number/path), inline field errors in the Environment
  tab, section auto-expand, tab focus on error, and inclusion of env
  file/directory params in pre-submit path validation and path
  normalization (keys prefixed to avoid cross-namespace collisions).
- Stop late-arriving cluster defaults from refilling an Extra Arguments
  field the user deliberately cleared: track edits (typing or params
  import) instead of inferring from the empty string.
- Consolidate the jobs table on the shared formatDuration, which
  treats zone-less backend timestamps as UTC; the duplicate in
  appsJobsColumns silently depended on the API always emitting offsets.
- Match installed apps by canonical GitHub URL in Catalog and
  ListingDetail, aligning with AppDetail/AppLaunch so the comparison
  styles can't drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comments should document current state, not change history. Drop references to removed code (apps/worker.py, the old shell-metachar denylist, --env: flags), a past-commit citation, and "previously/now/anymore" phrasing; reword regression-test docstrings as present-tense invariants. Also correct the AnsiText palette comment, which claimed background-code support the parser doesn't implement.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bash-script path composition must stay POSIX regardless of the host OS: join the Apptainer cache dir with '/' instead of pathlib (which rewrites to backslashes on Windows), compute bind-path parents with PurePosixPath, and count Windows drive paths as absolute in bind computation so a dev/test server on Windows composes the same script that path validation already accepts.

Skip the service-helper/publisher tests on Windows (bash there is the WSL stub, and the snippets only ever run on Linux compute nodes) and the poll-loop stop-race test (fcntl is POSIX-only), matching the existing requirements-check skip. Make the submit-assembly assertions OS-neutral: os.path.join for the OS-native work-dir log paths, as_posix() for the '/'-normalized bind path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Use an equals separator for Nextflow schema-derived parameters so values that begin with '-' stay attached to their intended flag while still preserving shell quoting for spaces. Keep the default space separator for other app parameters.\n\nCo-authored-by: Codex <codex@openai.com>
Emit explicit boolean values for Nextflow pipeline parameters so false overrides true defaults. Apply the Nextflow-safe form during command generation for older cached manifests while preserving generic switch behavior for other apps.\n\nCo-authored-by: Codex <codex@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant