The Problem
This post is talking about a âweirdâ problem we encountered in GreptimeDB. And, a little spoiler, itâs about the âasync cancellationâ.
Letâs first describe the (simplified) scenario. We observed metadata corruption in a long-run test. A series number is duplicated, but it should be increased monotonously. The update logic is very straightforward â load value from an in-memory atomic counter, persist the new series number to file, and then update the in-memory counter. The entire procedure is serialized (file
is a mutable reference):
1 |
|
For some reason, we are not using fetch_add
here, though it does work, and if weâve done so, then the story wonât happen đ€Ș For example, we donât want to update the in-memory counter when this procedure fails halfway, like operation persist_number()
cannot write to the file. Here the sweet sugar ?
stands for such a situation. We know clearly that this function call may fail, and if it fails the caller returns early to propagate the error. So we will handle it carefully with that in mind.
But things become tricky with .await
, the hidden control flow comes due to async cancellation.
Async Cancellation
async task and runtime
If you have figured out the shape of the âcriminalâ, you may want to skip this section. Instead, Iâll start with some pseudocode to show what happens in the âawait pointâ, and how it interacts with the runtime. First is poll_future
, it comes from the Future
âs poll
function, as every async fn
âs we write will be desugared to an anonymous Future
implementation.
1 |
|
async
block usually contains other async functions, like update_metadata
and persist_number
. Say persist_number
is a sub-async-task of update_metadata
. Each .await
point will be expanded to something like poll_future
â await
ing the subtaskâs output and make progress when the subtask is ready. Here we need to wait persist_number
âs task returns Ready
before we update the counter, otherwise we cannot do it.
And the second one is a (toy) runtime, which is in response to poll futures delivered to it. In GreptimeDB we use tokio
as our runtime. An async runtime may have tons of features and logic, but the most basic one is to poll: as the name tells, keep running unfinished tasks until they finish (but consider the things Iâm going to write later, the âuntilâ might not be a proper word).
1 |
|
That is it, a very minimalist model of future and runtime. Combining these two functions, you will find that in some aspects, it is just a loop (again, Iâve omitted lots of details to keep tight on the topic; the real world is way more complex). I want to stress the thing that each .await
imply one or more function calls (call to poll()
or poll_future()
). This is the âhidden control flowâ in the title and the place cancellation takes effort.
1 |
|
Know it is not hard, but thinking about it is not easy (at least for me). Iâve stared at these lines for minutes after I narrow the scope down to as simple as the first code snippet. I know the problem is definitely in the .await
. But donât know whether too many successful async calls have numbed me or my mental model hasnât linked these two points. The bulky garbage, me, spent a whole sleepless night doubting life and the world.
cancellation
So far is the standard part. We will then talk about cancellation, which is runtime-dependent. Though many runtimes in rust have similar behavior, this is not a required feature, i.e., a runtime can not support cancellation at all like this toy. And Iâll take tokio as an example because the story happens there. Other runtimes may be similar.
In tokio, one can use JoinHandle::abort()
to cancel a task. Tasks have a âcancel marker bitâ tracks whether itâs cancelled. And if the runtime finds a task is cancelled, it will kill that task (code from here):
1 |
|
The theory behind async cancellation is also elementary. Itâs just the runtime gives up to keep polling your task when itâs not yet finished, just like ?
or even tougher because we cannot catch this cancellation like Err
. But does it means that we need to take care of every single .await
? It would be very annoying. Take this metadata updating as an example. If we have to consider this, we need to check if the file is consistent with the memory state and revert the persisted change if found inconsistency. Well⊠đ« In some aspects, yes. The runtime can literally do anything to your future. But the good thing is that most of them are disciplined.
Runtime Behavior
This section will discuss what I would expect from a runtime and what we can get for now.
marker trait
I want the runtime not to cancel my task unconditionally and turn to the type system for help. This is wondering if there is a marker trait like CancelSafe
. For the word cancellation safety, tokio has said about it in its documentation:
To determine whether your own methods are cancellation safe, look for the location of uses of
.await
. This is because when an asynchronous method is cancelled, that always happens at an.await
. If your function behaves correctly even if it is restarted while waiting at an.await
, then it is cancellation safe.
That is, whether a task is safe to be cancelled. This is definitely an âattributeâ of an async task. You can find that tokio has a long list of what is safe and what isnât in the library from the above link. And, in some ways, I think itâs just like the UnwindSafe
marker. Both are âthis sort of control flow is not always anticipatedâ and âhas the possibility of causing subtle bugsâ.
With such a CancelSafe
trait, we can tell the runtime if our spawned future is ok to be cancelled, and we promise the âcancellingâ control flow is carefully handled. And if without this, means we donât want the task to be cancelled. Simple and clear. This is also an approach for the runtimes to require their users (like you and me) to check if their tasks are able to be cancelled. Take timeout()
as an example:
1 |
|
volunteer cancel
Another approach is to cancel voluntarily. Like the cooperative cancellation in Kotlin, it has an isActive
method for a task to check if it is cancelled. And this is only a tester method, to cancel or not is fully dependent on the task itself. I paste an example from Kotlinâs document below, the âcooperative cancellationâ happens in line 5. This way brings the âhidden control flowâ on the table and makes it more natural to consider and handle the cancellation just like Option
or Result
.
1 |
|
And this is not hard to achieve in my opinion. Tokio already has the Cancelled
bit and CancellationToken
. But they look a bit different than what I describe. And after all of these, we need runtime to give the cancellation right back to our task. Or the situation might not have big difference.
explicit detach
Can we force the runtime not to cancel our tasks at present? In tokio we can âdetachâ a task to the background by dropping the JoinHandle
. A detached task means there is no foreground handle to the spawned task, and in some aspect, others cannot wrap a timeout
or select
over it, making it un-cancellable. And the problem in the very beginning is solved in this way.
A
JoinHandle
detaches the associated task when it is dropped, which means that there is no longer any handle to the task, and no way tojoin
on it.
Though there is the functionality, I would wonder if itâs better to have an explicit detach
method like glommio's
, or even a detach
method in the runtime like spawn
, which doesnât return the JoinHandle
. But these are trifles. A runtime usually wonât cancel a task for no reason, and in most cases, itâs required by the users. But sometimes you havenât noticed that, like those âunselect branchesâ in select
, or the logic in tonic
âs request handler. And if we are sure that a task is ready for cancellation, explicit detach may prevent it from tragedy sometime.
Back To The Problem
So far everything is clear. Letâs start to wipe out this bug! First is, why our future is cancelled? Through the function call graph we can easily find the entire process procedure is executed in-place in tonic
âs request licensing runtime, and itâs common for an internet request to have a timeout behavior. And the solution is also simple, just detaching the server processing logic into another runtime to prevent it from cancelled with the request. Only a few lines:
1 |
|
Now all fine.