Skip to content

feat(async): add task::kill() for force-terminating async tasks#698

Open
ytakano wants to merge 22 commits into
tier4:mainfrom
ytakano:feat/task-kill
Open

feat(async): add task::kill() for force-terminating async tasks#698
ytakano wants to merge 22 commits into
tier4:mainfrom
ytakano:feat/task-kill

Conversation

@ytakano

@ytakano ytakano commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds task::kill(task_id) to awkernel_async_lib, which marks a task as Terminated and removes it from the global TASKS registry. Also fixes a race in run_main() where Poll::Pending could overwrite a Terminated state set by a concurrent kill() call.

Adds files for AI agents also.

Related links

Issue #362 and #363.

How was this PR tested?

Qemu Test

Tested on a Qemu VM by using applications/tests/test_task_kill.

Welcome to Awkernel!

You can use BLisp language as follows.
https://ytakano.github.io/blisp/

> (factorial 20)
2432902008176640000
> (+ 10 20)
30

Enjoy!

> [         5112 INFO] TASK_KILL_TEST start
[         5134 INFO] TASK_KILL_TEST kill_sleeping_task: PASS (kill returned true)
[         5135 INFO] TASK_KILL_TEST task_removed_from_registry: PASS
[         5135 INFO] TASK_KILL_TEST kill_sleeping_task: PASS (resources freed)
[         5135 INFO] TASK_KILL_TEST kill_unknown_id: PASS
[         5186 INFO] TASK_KILL_TEST kill_idempotent: PASS (second kill returned false)
[         5237 INFO] TASK_KILL_TEST resources_freed_after_kill: PASS
[         5237 INFO] TASK_KILL_TEST kill_preempted_task: step 1 start
[         5239 INFO] TASK_KILL_TEST kill_preempted_task: step 2 spawned id=3
[         5290 INFO] TASK_KILL_TEST kill_preempted_task: step 3 after sleep 50ms
[         5290 INFO] TASK_KILL_TEST kill_preempted_task: step 4 kill=true
[         5591 INFO] TASK_KILL_TEST kill_preempted_task: step 5 after sleep 300ms
[         5591 INFO] TASK_KILL_TEST kill_preempted_task: PASS
[         5592 INFO] TASK_KILL_TEST kill_panicked_task: step 1 start
[         5592 ERROR] kernel/src/nostd.rs:162: panic: panicked at applications/tests/test_task_kill/src/lib.rs:171:13:
intentional panic for kill semantics test
[         5598 INFO] 00:        0x000000005c4b03 - UNKNOWN
[         5600 INFO] 01:        0x000000004313e8 - UNKNOWN
[         5602 INFO] 02:        0x0000000061097b - UNKNOWN
[         5604 INFO] 03:        0x00000000430ed1 - UNKNOWN
[         5607 INFO] 04:        0x0000000059869d - UNKNOWN
[         5608 INFO] 05:        0x0000000059661b - UNKNOWN
[         5610 INFO] 06:        0x000000004261ae - UNKNOWN
[         5610 INFO] 07:        0x0000000042ce1f - UNKNOWN
[         5610 INFO] 08:        0x000000000000fb - UNKNOWN
[         5692 INFO] TASK_KILL_TEST kill_panicked_task: PASS (panicked task removed from registry)
[         5693 INFO] TASK_KILL_TEST kill_panicked_task: PASS (kill returned false)
[         5694 INFO] TASK_KILL_TEST done

SPIN Model Chcker

Update specification/awkernel_async_lib/src/task/preemptive_spin to handle kill() and tested it works correctly.

Notes for reviewers

ytakano and others added 6 commits June 16, 2026 17:43
Adds `task::kill(task_id)` to `awkernel_async_lib`, which marks a task
as `Terminated` and removes it from the global TASKS registry. Also fixes
a race in `run_main()` where `Poll::Pending` could overwrite a `Terminated`
state set by a concurrent `kill()` call. Phase 0 of the PCIe detach roadmap.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds a kernel test application that verifies task::kill() semantics at
runtime: killing a sleeping task removes it from the registry, killing an
unknown ID returns false, and a second kill on a dead task returns false.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds resources_freed_after_kill() to the test_task_kill application.
A DropTracker struct increments a global counter on drop; the test
kills the owner task while it is sleeping and confirms the counter
reaches 1 after the sleep timer fires and releases the last Arc<Task>.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
In nographic mode QEMU provides no GOP framebuffer, so draw() returns
Err(NoFrameBuffer) immediately.  Instead of ignoring the error and
looping every 5 seconds forever, probe on startup and return if there
is nothing to draw — eliminating needless scheduler churn and making
the service self-cleaning in headless environments.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When task::kill() removes a task from TASKS, the timer closure for that
task's ongoing sleep() becomes the sole Arc<Task> holder.  If the timer
fires before the Arc refcount reaches zero, the closure calls waker.wake()
while still holding the Sleep state mutex.  For a Terminated task, wake()
returns immediately and drops the Arc inline; that triggers Task::drop() →
Future::drop() → Sleep::drop(), which tries to re-acquire the same state
mutex — instant spinlock deadlock, hanging the timer thread and blocking
all subsequent sleeps.

Fix: capture the wake decision under the state lock, release the lock,
then call waker.wake() only if the state was Wait.  Sleep::drop() now
sees State::Finished and exits without trying to lock.

Reproducer: test_task_kill::resources_freed_after_kill() kills a task
mid-sleep(200ms); the 200ms timer then fires and previously deadlocked
the scheduler, leaving the test task's own sleep(300ms) never waking up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends awkernel_async_lib with a task::kill(task_id) API to force-terminate async tasks and hardens the executor against a race where Poll::Pending could overwrite a concurrently-set Terminated state. It also adds an integration-style test application wired through userland/kernel features.

Changes:

  • Add task::kill(u32) -> bool that sets task state to Terminated and removes it from the global TASKS registry.
  • Fix run_main() to avoid overwriting Terminated/Panicked back to Waiting on Poll::Pending.
  • Avoid a potential deadlock in Sleep timer callback by releasing its internal state lock before calling waker.wake(), and add a test_task_kill app to validate behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
userland/src/lib.rs Wires test_task_kill::run() into the userland entrypoint under a feature flag.
userland/Cargo.toml Adds optional dependency + feature flag for the new test application.
kernel/Cargo.toml Propagates the test_task_kill feature to userland.
awkernel_async_lib/src/task.rs Implements task::kill() and fixes executor state handling under concurrent termination.
awkernel_async_lib/src/sleep_task.rs Drops Sleep’s state lock before waking to prevent deadlock during kill-driven drops.
applications/tests/test_task_kill/src/lib.rs Adds a test app covering kill success/unknown/idempotence and drop/resource release.
applications/tests/test_task_kill/Cargo.toml Defines the new no-std test crate and its dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread awkernel_async_lib/src/task.rs Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
///
/// Returns `true` if the task was found and killed, `false` if it was not found or was
/// already in a terminal state.
pub fn kill(task_id: u32) -> bool {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR — kill() defers resource reclamation, possibly forever

kill() removes the task from TASKS and sets Terminated, but it never drops the task's future. The Arc<Task> (and therefore the future and all resources it owns) is freed only when the last Arc<Task> clone is dropped — i.e. when every Waker clone held outside TASKS is dropped.

For a task parked in Sleep, that Waker lives in the timer delta-list and is dropped only when the timer fires — so reclamation is deferred for up to the full remaining sleep duration. In kill_sleeping_task the target sleeps for 3600s, so after kill() its memory stays pinned for ~an hour even though the registry check passes.

Worse: a task parked on a Waker that never fires (e.g. a channel recv with no remaining sender, or any custom future that stashed its waker elsewhere) is leaked permanently — invisible, since it is already out of TASKS.

The doc says "the caller is responsible for deregistering them", but no API is exposed to deregister internal wakers, so callers generally cannot. Consider having kill() proactively reclaim: try_lock the future mutex and drop / fuse-terminate the inner future to release resources immediately. Note the in-flight-poll case (future lock held by a polling CPU): if try_lock fails, the Poll::Pending arm in run_main (around task.rs:884-895) currently also does not drop the future, so resources still are not freed — that arm should drop / terminate the future when it observes Terminated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. kill() now tries to drop the task future when it terminates a non-preempted task, and run_main also drops the future when it observes Terminated/Panicked after an in-flight poll returns Pending. For Preempted, kill_pending defers the termination until the next poll boundary, then removes the task from TASKS and drops the future there. I also updated the kill() docs to describe the in-flight poll and terminal-state behavior.

}
};
if should_wake {
waker.wake();

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAJOR — killed-task Drop chain now runs under the global SLEEPING lock

The timer handler is invoked from SleepingTasks::wake_task() (scheduler.rs:327, handler()), which runs while the global SLEEPING mutex is held (scheduler.rs:383-384). With kill(), it is now common for waker.wake() here to drop the last Arc<Task> inline (state is TerminatedTask::wake returns and drops self). That triggers the full Task → future → nested-future Drop chain synchronously, under SLEEPING.

Releasing the sleep state lock before wake() (good fix!) only covers Sleep::drop. The broader invariant is now: arbitrary user Drop code runs while SLEEPING is held. Any Drop impl in a task's future that schedules a timer / sleeps / otherwise locks SLEEPING will self-deadlock — MCS locks are non-reentrant. Sleep::drop is safe today (locks only its own state), but this is a fragile, easy-to-violate invariant.

Consider deferring the final Arc<Task> drop until after SLEEPING is released (e.g. pop the fired handlers, drop the SLEEPING guard, then invoke them), and/or documenting the invariant.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed. I kept the Sleep-local state lock release, and also changed scheduler wakeup handling so expired sleep handlers are taken while SLEEPING is locked but invoked only after the SLEEPING guard is dropped. That avoids running waker.wake() and the possible final Arc/future Drop chain under the global SLEEPING lock.

}
};

// Step 2: Under the info lock, transition state to Terminated.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR — kill() is asynchronous w.r.t. an in-flight poll

kill() returns true as soon as it sets Terminated, but if the task is being polled on another CPU the future keeps running to the end of that poll (cooperative — expected). Worth documenting that kill() does not interrupt an in-flight poll, and (per the resource-reclamation comment) that the future is not necessarily dropped even after that poll completes.

kill_idempotent().await;
resources_freed_after_kill().await;

log::info!("TASK_KILL_TEST done");

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MINOR — test gap & timing fragility

kill_sleeping_task spawns a task that sleeps 3600s and only asserts registry removal — it cannot observe whether the task/future was actually reclaimed (the timer won't fire for an hour), so it gives false confidence that kill() "frees" the task. Only resources_freed_after_kill (200ms sleep) exercises reclamation, and it depends on the exact deferred-drop timing.

All cases also use fixed wall-clock sleeps (50/300ms), which are timing-fragile under load or a different timer resolution. Consider a shorter sleep + an explicit reclamation assertion for the "sleeping task" case.

Comment thread awkernel_async_lib/src/task.rs Outdated
/// Sets the task state to `Terminated` and removes it from the global task registry.
/// Any subsequent `wake()` call for this task will be a no-op. If the task was in
/// `Waiting` state, wakers may still hold `Arc<Task>` references; the `Task` is freed
/// only after those wakers are dropped — the caller is responsible for deregistering them.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT — doc comment precision

"the caller is responsible for deregistering them" — clarify how (which API), or soften the wording, since no such API is exposed. Also worth noting: kill() returns false when the task became terminal between Step 1 and Step 2 (natural completion racing the kill), which is correct behavior but currently undocumented.

@ytakano

ytakano commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Review summary (concurrency focus)

The kill() lock protocol (TASKS → info → TASKS, never nested), the run_main Poll::Pending guard against resurrecting a killed task, and the sleep state-lock-release deadlock fix are all correct and carefully reasoned. I also checked that NUM_TASK_IN_QUEUE stays balanced (a killed-but-queued task is decremented when popped in get_next_task) and that the registry protocol is safe against id reuse — both look fine.

The two substantive concerns are about lifecycle rather than the state machine:

  1. Resource reclamationkill() never drops the future, so reclamation is deferred until the last Waker drops (up to the full sleep duration, or never for a waker that never fires → permanent leak).
  2. Drop-under-SLEEPING — with kill() it is now common for the last Arc<Task> to be dropped inside the timer handler while the global SLEEPING lock is held, so arbitrary user Drop code runs under that lock.

Inline comments have details and suggestions.

ytakano and others added 14 commits June 17, 2026 07:43
…ination

- Introduce kill_pending flag in TaskInfo (cfg not no_preempt): when kill()
  is called on a State::Preempted task, set the flag instead of overwriting
  state. yield_preempted_and_wake_task() unconditionally writes State::Preempted
  which would clobber a direct State::Terminated write, so deferral is required.
- Poll::Pending handler in run_main() checks take_kill_pending() and finalizes
  termination (state=Terminated + TASKS.remove) at the task's next await point.
- Drop the Arc<Task> in run_main() before calling yield_and_pool(ctx) so the
  thread pool does not hold a stale reference. Without this drop, the pooled
  thread's task Arc would only be released on the next preemption, which may
  never happen after kill() — leaving the future (and its DropTracker) alive.
- Fix do_preemption() TOCTOU: keep the Arc<Task> from the first get_task()
  lookup alive through the entire function, removing the second unwrap() lookup
  and handling the None case (task already removed by kill()) gracefully.
- Add kill_preempted_task() test using PrioritizedRR to exercise the deferred
  kill path; verify kill() returns true, task leaves TASKS, and DropTracker fires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Yuuki Takano <ytakanoster@gmail.com>
Signed-off-by: Yuuki Takano <ytakanoster@gmail.com>
Signed-off-by: Yuuki Takano <ytakanoster@gmail.com>
@ytakano ytakano marked this pull request as ready for review June 18, 2026 01:14
@ytakano ytakano requested a review from mkakh June 18, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants