feat(async): add task::kill() for force-terminating async tasks#698
feat(async): add task::kill() for force-terminating async tasks#698ytakano wants to merge 22 commits into
Conversation
Adds `task::kill(task_id)` to `awkernel_async_lib`, which marks a task as `Terminated` and removes it from the global TASKS registry. Also fixes a race in `run_main()` where `Poll::Pending` could overwrite a `Terminated` state set by a concurrent `kill()` call. Phase 0 of the PCIe detach roadmap. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a kernel test application that verifies task::kill() semantics at runtime: killing a sleeping task removes it from the registry, killing an unknown ID returns false, and a second kill on a dead task returns false. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds resources_freed_after_kill() to the test_task_kill application. A DropTracker struct increments a global counter on drop; the test kills the owner task while it is sleeping and confirms the counter reaches 1 after the sleep timer fires and releases the last Arc<Task>. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In nographic mode QEMU provides no GOP framebuffer, so draw() returns Err(NoFrameBuffer) immediately. Instead of ignoring the error and looping every 5 seconds forever, probe on startup and return if there is nothing to draw — eliminating needless scheduler churn and making the service self-cleaning in headless environments. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When task::kill() removes a task from TASKS, the timer closure for that task's ongoing sleep() becomes the sole Arc<Task> holder. If the timer fires before the Arc refcount reaches zero, the closure calls waker.wake() while still holding the Sleep state mutex. For a Terminated task, wake() returns immediately and drops the Arc inline; that triggers Task::drop() → Future::drop() → Sleep::drop(), which tries to re-acquire the same state mutex — instant spinlock deadlock, hanging the timer thread and blocking all subsequent sleeps. Fix: capture the wake decision under the state lock, release the lock, then call waker.wake() only if the state was Wait. Sleep::drop() now sees State::Finished and exits without trying to lock. Reproducer: test_task_kill::resources_freed_after_kill() kills a task mid-sleep(200ms); the 200ms timer then fires and previously deadlocked the scheduler, leaving the test task's own sleep(300ms) never waking up. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This reverts commit 4769720.
There was a problem hiding this comment.
Pull request overview
This PR extends awkernel_async_lib with a task::kill(task_id) API to force-terminate async tasks and hardens the executor against a race where Poll::Pending could overwrite a concurrently-set Terminated state. It also adds an integration-style test application wired through userland/kernel features.
Changes:
- Add
task::kill(u32) -> boolthat sets task state toTerminatedand removes it from the globalTASKSregistry. - Fix
run_main()to avoid overwritingTerminated/Panickedback toWaitingonPoll::Pending. - Avoid a potential deadlock in
Sleeptimer callback by releasing its internal state lock before callingwaker.wake(), and add atest_task_killapp to validate behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| userland/src/lib.rs | Wires test_task_kill::run() into the userland entrypoint under a feature flag. |
| userland/Cargo.toml | Adds optional dependency + feature flag for the new test application. |
| kernel/Cargo.toml | Propagates the test_task_kill feature to userland. |
| awkernel_async_lib/src/task.rs | Implements task::kill() and fixes executor state handling under concurrent termination. |
| awkernel_async_lib/src/sleep_task.rs | Drops Sleep’s state lock before waking to prevent deadlock during kill-driven drops. |
| applications/tests/test_task_kill/src/lib.rs | Adds a test app covering kill success/unknown/idempotence and drop/resource release. |
| applications/tests/test_task_kill/Cargo.toml | Defines the new no-std test crate and its dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| /// | ||
| /// Returns `true` if the task was found and killed, `false` if it was not found or was | ||
| /// already in a terminal state. | ||
| pub fn kill(task_id: u32) -> bool { |
There was a problem hiding this comment.
MAJOR — kill() defers resource reclamation, possibly forever
kill() removes the task from TASKS and sets Terminated, but it never drops the task's future. The Arc<Task> (and therefore the future and all resources it owns) is freed only when the last Arc<Task> clone is dropped — i.e. when every Waker clone held outside TASKS is dropped.
For a task parked in Sleep, that Waker lives in the timer delta-list and is dropped only when the timer fires — so reclamation is deferred for up to the full remaining sleep duration. In kill_sleeping_task the target sleeps for 3600s, so after kill() its memory stays pinned for ~an hour even though the registry check passes.
Worse: a task parked on a Waker that never fires (e.g. a channel recv with no remaining sender, or any custom future that stashed its waker elsewhere) is leaked permanently — invisible, since it is already out of TASKS.
The doc says "the caller is responsible for deregistering them", but no API is exposed to deregister internal wakers, so callers generally cannot. Consider having kill() proactively reclaim: try_lock the future mutex and drop / fuse-terminate the inner future to release resources immediately. Note the in-flight-poll case (future lock held by a polling CPU): if try_lock fails, the Poll::Pending arm in run_main (around task.rs:884-895) currently also does not drop the future, so resources still are not freed — that arm should drop / terminate the future when it observes Terminated.
There was a problem hiding this comment.
Addressed. kill() now tries to drop the task future when it terminates a non-preempted task, and run_main also drops the future when it observes Terminated/Panicked after an in-flight poll returns Pending. For Preempted, kill_pending defers the termination until the next poll boundary, then removes the task from TASKS and drops the future there. I also updated the kill() docs to describe the in-flight poll and terminal-state behavior.
| } | ||
| }; | ||
| if should_wake { | ||
| waker.wake(); |
There was a problem hiding this comment.
MAJOR — killed-task Drop chain now runs under the global SLEEPING lock
The timer handler is invoked from SleepingTasks::wake_task() (scheduler.rs:327, handler()), which runs while the global SLEEPING mutex is held (scheduler.rs:383-384). With kill(), it is now common for waker.wake() here to drop the last Arc<Task> inline (state is Terminated → Task::wake returns and drops self). That triggers the full Task → future → nested-future Drop chain synchronously, under SLEEPING.
Releasing the sleep state lock before wake() (good fix!) only covers Sleep::drop. The broader invariant is now: arbitrary user Drop code runs while SLEEPING is held. Any Drop impl in a task's future that schedules a timer / sleeps / otherwise locks SLEEPING will self-deadlock — MCS locks are non-reentrant. Sleep::drop is safe today (locks only its own state), but this is a fragile, easy-to-violate invariant.
Consider deferring the final Arc<Task> drop until after SLEEPING is released (e.g. pop the fired handlers, drop the SLEEPING guard, then invoke them), and/or documenting the invariant.
There was a problem hiding this comment.
Addressed. I kept the Sleep-local state lock release, and also changed scheduler wakeup handling so expired sleep handlers are taken while SLEEPING is locked but invoked only after the SLEEPING guard is dropped. That avoids running waker.wake() and the possible final Arc/future Drop chain under the global SLEEPING lock.
| } | ||
| }; | ||
|
|
||
| // Step 2: Under the info lock, transition state to Terminated. |
There was a problem hiding this comment.
MINOR — kill() is asynchronous w.r.t. an in-flight poll
kill() returns true as soon as it sets Terminated, but if the task is being polled on another CPU the future keeps running to the end of that poll (cooperative — expected). Worth documenting that kill() does not interrupt an in-flight poll, and (per the resource-reclamation comment) that the future is not necessarily dropped even after that poll completes.
| kill_idempotent().await; | ||
| resources_freed_after_kill().await; | ||
|
|
||
| log::info!("TASK_KILL_TEST done"); |
There was a problem hiding this comment.
MINOR — test gap & timing fragility
kill_sleeping_task spawns a task that sleeps 3600s and only asserts registry removal — it cannot observe whether the task/future was actually reclaimed (the timer won't fire for an hour), so it gives false confidence that kill() "frees" the task. Only resources_freed_after_kill (200ms sleep) exercises reclamation, and it depends on the exact deferred-drop timing.
All cases also use fixed wall-clock sleeps (50/300ms), which are timing-fragile under load or a different timer resolution. Consider a shorter sleep + an explicit reclamation assertion for the "sleeping task" case.
| /// Sets the task state to `Terminated` and removes it from the global task registry. | ||
| /// Any subsequent `wake()` call for this task will be a no-op. If the task was in | ||
| /// `Waiting` state, wakers may still hold `Arc<Task>` references; the `Task` is freed | ||
| /// only after those wakers are dropped — the caller is responsible for deregistering them. |
There was a problem hiding this comment.
NIT — doc comment precision
"the caller is responsible for deregistering them" — clarify how (which API), or soften the wording, since no such API is exposed. Also worth noting: kill() returns false when the task became terminal between Step 1 and Step 2 (natural completion racing the kill), which is correct behavior but currently undocumented.
Review summary (concurrency focus)The The two substantive concerns are about lifecycle rather than the state machine:
Inline comments have details and suggestions. |
…ination - Introduce kill_pending flag in TaskInfo (cfg not no_preempt): when kill() is called on a State::Preempted task, set the flag instead of overwriting state. yield_preempted_and_wake_task() unconditionally writes State::Preempted which would clobber a direct State::Terminated write, so deferral is required. - Poll::Pending handler in run_main() checks take_kill_pending() and finalizes termination (state=Terminated + TASKS.remove) at the task's next await point. - Drop the Arc<Task> in run_main() before calling yield_and_pool(ctx) so the thread pool does not hold a stale reference. Without this drop, the pooled thread's task Arc would only be released on the next preemption, which may never happen after kill() — leaving the future (and its DropTracker) alive. - Fix do_preemption() TOCTOU: keep the Arc<Task> from the first get_task() lookup alive through the entire function, removing the second unwrap() lookup and handling the None case (task already removed by kill()) gracefully. - Add kill_preempted_task() test using PrioritizedRR to exercise the deferred kill path; verify kill() returns true, task leaves TASKS, and DropTracker fires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yuuki Takano <ytakanoster@gmail.com>
Description
Adds
task::kill(task_id)toawkernel_async_lib, which marks a task asTerminatedand removes it from the global TASKS registry. Also fixes a race inrun_main()wherePoll::Pendingcould overwrite aTerminatedstate set by a concurrentkill()call.Adds files for AI agents also.
Related links
Issue #362 and #363.
How was this PR tested?
Qemu Test
Tested on a Qemu VM by using
applications/tests/test_task_kill.SPIN Model Chcker
Update
specification/awkernel_async_lib/src/task/preemptive_spinto handlekill()and tested it works correctly.Notes for reviewers