Let’s first describe the (simplified) scenario. We observed metadata corruption in a long-run test. A series number is duplicated, but it should be increased monotonously. The update logic is very straightforward – load value from an in-memory atomic counter, persist the new series number to file, and then update the in-memory counter. The entire procedure is serialized (
file is a mutable reference):
For some reason, we are not using
fetch_add here, though it does work, and if we’ve done so, then the story won’t happen 🤪 For example, we don’t want to update the in-memory counter when this procedure fails halfway, like operation
persist_number() cannot write to the file. Here the sweet sugar
? stands for such a situation. We know clearly that this function call may fail, and if it fails the caller returns early to propagate the error. So we will handle it carefully with that in mind.
But things become tricky with
.await, the hidden control flow comes due to async cancellation.
If you have figured out the shape of the “criminal”, you may want to skip this section. Instead, I’ll start with some pseudocode to show what happens in the “await point”, and how it interacts with the runtime. First is
poll_future, it comes from the
poll function, as every
async fn‘s we write will be desugared to an anonymous
async block usually contains other async functions, like
persist_number is a sub-async-task of
.await point will be expanded to something like
awaiting the subtask’s output and make progress when the subtask is ready. Here we need to wait
persist_number‘s task returns
Ready before we update the counter, otherwise we cannot do it.
And the second one is a (toy) runtime, which is in response to poll futures delivered to it. In GreptimeDB we use
tokio as our runtime. An async runtime may have tons of features and logic, but the most basic one is to poll: as the name tells, keep running unfinished tasks until they finish (but consider the things I’m going to write later, the “until” might not be a proper word).
That is it, a very minimalist model of future and runtime. Combining these two functions, you will find that in some aspects, it is just a loop (again, I’ve omitted lots of details to keep tight on the topic; the real world is way more complex). I want to stress the thing that each
.await imply one or more function calls (call to
poll_future()). This is the “hidden control flow” in the title and the place cancellation takes effort.
Know it is not hard, but thinking about it is not easy (at least for me). I’ve stared at these lines for minutes after I narrow the scope down to as simple as the first code snippet. I know the problem is definitely in the
.await. But don’t know whether too many successful async calls have numbed me or my mental model hasn’t linked these two points. The bulky garbage, me, spent a whole sleepless night doubting life and the world.
So far is the standard part. We will then talk about cancellation, which is runtime-dependent. Though many runtimes in rust have similar behavior, this is not a required feature, i.e., a runtime can not support cancellation at all like this toy. And I’ll take tokio as an example because the story happens there. Other runtimes may be similar.
In tokio, one can use
JoinHandle::abort() to cancel a task. Tasks have a “cancel marker bit” tracks whether it’s cancelled. And if the runtime finds a task is cancelled, it will kill that task (code from here):
The theory behind async cancellation is also elementary. It’s just the runtime gives up to keep polling your task when it’s not yet finished, just like
? or even tougher because we cannot catch this cancellation like
Err. But does it means that we need to take care of every single
.await? It would be very annoying. Take this metadata updating as an example. If we have to consider this, we need to check if the file is consistent with the memory state and revert the persisted change if found inconsistency. Well… In some aspects, yes. The runtime can literally do anything to your future. But the good thing is that most of them are disciplined.
This section will discuss what I would expect from a runtime and what we can get for now.
I want the runtime not to cancel my task unconditionally and turn to the type system for help. This is wondering if there is a marker trait like
CancelSafe. For the word cancellation safety, tokio has said about it in its documentation:
To determine whether your own methods are cancellation safe, look for the location of uses of
.await. This is because when an asynchronous method is cancelled, that always happens at an
.await. If your function behaves correctly even if it is restarted while waiting at an
.await, then it is cancellation safe.
That is, whether a task is safe to be cancelled. This is definitely an “attribute” of an async task. You can find that tokio has a long list of what is safe and what isn’t in the library from the above link. And, in some ways, I think it’s just like the
UnwindSafe marker. Both are “this sort of control flow is not always anticipated“ and “has the possibility of causing subtle bugs“.
With such a
CancelSafe trait, we can tell the runtime if our spawned future is ok to be cancelled, and we promise the “cancelling” control flow is carefully handled. And if without this, means we don’t want the task to be cancelled. Simple and clear. This is also an approach for the runtimes to require their users (like you and me) to check if their tasks are able to be cancelled. Take
timeout() as an example:
Another approach is to cancel voluntarily. Like the cooperative cancellation in Kotlin, it has an
isActive method for a task to check if it is cancelled. And this is only a tester method, to cancel or not is fully dependent on the task itself. I paste an example from Kotlin’s document below, the “cooperative cancellation” happens in line 5. This way brings the “hidden control flow” on the table and makes it more natural to consider and handle the cancellation just like
And this is not hard to achieve in my opinion. Tokio already has the
Cancelled bit and
CancellationToken. But they look a bit different than what I describe. And after all of these, we need runtime to give the cancellation right back to our task. Or the situation might not have big difference.
Can we force the runtime not to cancel our tasks at present? In tokio we can “detach” a task to the background by dropping the
JoinHandle. A detached task means there is no foreground handle to the spawned task, and in some aspect, others cannot wrap a
select over it, making it un-cancellable. And the problem in the very beginning is solved in this way.
JoinHandledetaches the associated task when it is dropped, which means that there is no longer any handle to the task, and no way to
Though there is the functionality, I would wonder if it’s better to have an explicit
detach method like
glommio's, or even a
detach method in the runtime like
spawn, which doesn’t return the
JoinHandle. But these are trifles. A runtime usually won’t cancel a task for no reason, and in most cases, it’s required by the users. But sometimes you haven’t noticed that, like those “unselect branches” in
select, or the logic in
tonic‘s request handler. And if we are sure that a task is ready for cancellation, explicit detach may prevent it from tragedy sometime.
So far everything is clear. Let’s start to wipe out this bug! First is, why our future is cancelled? Through the function call graph we can easily find the entire process procedure is executed in-place in
tonic‘s request licensing runtime, and it’s common for an internet request to have a timeout behavior. And the solution is also simple, just detaching the server processing logic into another runtime to prevent it from cancelled with the request. Only a few lines:
Now all fine.