[rust-dev] threads, tasks, scheduling, migrating, failing

Graydon Hoare graydon at mozilla.com
Wed May 25 10:53:37 PDT 2011


Hi,

We had a brief discussion on IRC yesterday that ended in me storming off 
in a very unprofessional manner. I'd like to publicly apologize for that 
behavior, it was not cool and had little to do with the conversation at 
hand. My stress level was very high coming into work yesterday, and I 
was letting personal life spill into work life. My fault, sorry.

I'd also like to restart (or at least restate) parts of that discussion 
here so we can actually get this worked out to everyone's satisfaction 
(including Rafael, who clearly has strong feelings on the matter). This 
will be a long rambly email full of back-story to set the stage; if you 
have specific points to follow-up on, just snip those parts out for your 
replies.


Preface
~~~~~~~

We know (or at least, anyone who's poked at it knows) that the tasking 
and threading model in rustboot was pretty unsatisfactory. It passed 
some interesting tests ("millions of tasks, real cheap!") and had a 
number of needs it was trying to meet -- it was not designed in 
*complete* ignorance -- but it also imposed strange burdens that we'd 
like to dispense with this time around. Hopefully without losing the 
good stuff.

So, some background "goals" in order to make this story make sense. The 
three primary design pressures are:

   (1) Support *isolation*, in some useful sense, between tasks. That 
is, a task should be able to reason locally about its data and code 
without worrying whether other tasks are mucking with their own data and 
code. With the exception of message-IO points and unsafe blocks, which 
obviously involve the potential for non-isolated action.

   (2) Support *lots* of tasks. Millions. Such that a programmer has no 
fear about making "too many" tasks in a system, if it decomposes nicely. 
If for no other reason than support for isolation-based local reasoning 
(though concurrent, or interleaved, or truly parallel execution is also 
nice to exploit whenever it's surfaced this way).

   (3) Run at relatively high, but more importantly *predictable* 
performance. As few magical parts as possible in the concurrency model. 
Take the M:N performance-penalty if necessary to achieve the other 2 
goals, so long as there are no random peformance discontinuities in the 
model.

Concurrency model is intimately connected to memory model, unwinding, 
gc, and several other things; so when I say we're going to be revisiting 
design decisions in rustboot's concurrency model, I implicitly include 
parts of the memory model too, and other parts.


The Past (rustboot)
~~~~~~~~~~~~~~~~~~~

The "lots of tasks" pressure breaks down into two sub-issues: making 
tasks small (in the sense of memory) and making them independently 
scheduled. We approached the "small" issue via growable stacks (doubling 
vectors with pointer-rewriting) and a very large dose of ugly magic for 
doing calls "between stacks" (from rust to C). This had lots of 
unfortunate fallout: debuggers and tools got upset, calling back into 
rust code from C was mostly impossible, and to support it safely we'd 
need to be flushing pointers to the stack and re-reading them 
*constantly*, much more than just "make sure values are pinned 
somewhere" necessary for GC. We approached the "scheduling" issue by 
even *more* magic return-address patching during suspended C calls, and 
a custom cooperative scheduler.

The "isolation" pressure was approached by stratifying the heap memory 
model into private and shared (between-tasks) memory, with the shared 
stuff always immutable and acyclic. Sharing was not always possible -- 
not between threads and not between processes -- but between 
tasks-in-a-thread it could work, and we figured that scenario was 
valuable to users. Cheap sending of shared bits between tasks in a 
thread. Then we'd do a deep copy when we hit domain boundaries.

But to support that scenario, tasks had to be pinned to threads. That 
is, the concurrency scheme in rustboot involved tasks running within 
domains (threads or processes; though the latter never materialized), 
where the user explicitly constructed domains and injected threads into 
the domain. Once spawned in a domain, a task could not leave it (be 
migrated to another thread), because it might have pointers into the 
"shared memory" of the domain. This pinning to domains has the 
unfortunate performance characteristic of *requiring* a user to pay 
attention to task:thread assignments; this could be a benefit in some 
cases -- explicit control can be good -- but in many cases it seemed 
bad, or at least over-complex. It's not just a matter of "having an M:N 
scheduler in userspace" (which will, says the literature, always 
underperform a 1:1 scheduler with the kernel involved) but also pinning 
each individual task in M to a thread in N, such that one task blocking 
(or even just monopolizing a core) could block or slow down a whole 
group of tasks on the same thread. This is a usability hazard.


The Future (rustc)
~~~~~~~~~~~~~~~~~~

Rustboot is dead (yay!) and we're in the process of working through the 
leftover cruft in the runtime and removing parts which were bad ideas 
and/or only around to support rustboot's various limitations and design 
choices. Rustc doesn't really *have* a tasking system yet -- there are 
pieces of the communication layer, but eholk is just getting spawn 
working this week -- so we're mostly doing "rewrite this subsystem" work 
now. There's been a fair amount of disjointed conversation over this 
matter, I'm hoping to consolidate what's on the agenda here.

The "lots of tasks" issue still has two sub-parts: size and scheduling.

We're going to approach "size" via "the other, more standard technique", 
which is to use a linked list of stack segments and never move them. We 
will most likely reuse the *exact* technique Go is using here, in the 
sense of trying to be ABI compatible (at least insofar as this is 
possible or makes sense) and possibly even use the same runtime support 
library. This approach is easier for LLVM to cope with (there's a GSoC 
student implementing it currently), and more tools understand it. It 
also makes stack segments recyclable between tasks, which should reduce 
overall memory pressure (equivalent to "shrinking" in our other model). 
We're also going to use the same "unified" approach to growth and 
cross-language calling as Go uses -- just grow into a "sufficiently big" 
segment that may be recycled between tasks in between C calls -- and 
that may well permit C to call back into rust (assuming it can provide a 
task* and can be made to play nice with unwinding and GC; see below).

We're also going to approach "scheduling" via "the other, more standard 
technique", which is to use the posix (and before that, system V) 
<ucontext.h> schedulable user contexts and (sadly) again our own 
scheduler. Where ucontext isn't OS-provided, we'll provide our own 
implementation; it's not actually much more than a "save registers to 
structure A and load them from structure B" routine anyways, just with a 
standard API. And on some OSs -- specifically those where we discover 
threads are sufficiently cheap, if running on small stacks -- we're 
going to lean much more heavily on the OS thread scheduler. See below.

We're going to approach the "isolation" pressure differently. Rather 
than permit tasks to share pointers at all, we'll be shifting to a 
stratification of memory based on unique pointers. This will mean that 
the only possible kinds of send are "move" and "deep copy". Move will 
happen everywhere in-OS-process, deep-copy between processes. Move 
semantics -- making a copy while indivisibly de-initializing the source 
of the copy -- are going to be the focus of a fair bit of work over the 
next while, and we're betting on them for the messaging/isolation system.

Along with minimizing refcounting (and a host of other thorny semantic 
issues associated with accidental copying, such as environment capture 
and double-execution of destructors) this will permit tasks to migrate 
between threads. Or, put more simply, it will permit us to treat threads 
as an undifferentiated pool of N workers, and tasks as a pool of M work 
units; when a thread blocks in C code it will have no affect on the 
other tasks (other than temporarily using up a "large segment" of stack) 
and M>N runnable tasks should always "saturate" the threads (thus cores) 
with work. Moreover, when we're on an OS that has "really cheap threads" 
we can spin up N == M threads and fall into the case of 1:1 scheduling: 
back off and let the OS kernel do all the scheduling, have our scheduler 
always say "oh, just keep running the same task you're running" every 
time it checks at a yield point. On linux, for example, I believe that 
this may very well be more optimal than getting in the way with our own 
scheduler and ucontext logic. I'm not sure about other OSs but I'd like 
to retain this "dial" to be able to dial scheduling back into our 
runtime if we're on an OS with horrendously bad or otherwise expensive 
kernel threads.

In the process of this change, we'll eliminate the concept of a domain 
from the language and runtime, and just model OS processes as OS 
processes. We'll still need some runtime support for interacting with 
subprocesses, of course, just avoid trying to mix metaphors.

Moving to a pool-of-threads model should also permit leaning on "the 
other, more standard technique" for unwinding: the C++ unwinder (or a 
large part of it). Since a task blocked-in-C doesn't necessarily block 
any other tasks (they can run on other threads) we don't need to 
deschedule the unwinder, which was a large part of my concern for how it 
might be unusable in this role. We can just let unwinding run to 
completion (scheduler yield-points can always opt to not-yield when 
unwinding). A secondary concern has to do with double-faulting (failing 
within a destructor) but we'll cross that bridge when we come to it; I 
doubt the policy to call terminate() in C++ is so hard-wired into the 
unwinder that there are no possible ways of overriding it. Opinions on 
this welcome.

(Incidentally, the more we drift away from "our own per-frame metadata 
tables" towards relying on stock components, the more I think using a 
conservative stack scanner for GC root-finding may be perfectly fine, 
obviating the need for per-frame GC info and explicit GC-safe points. 
But that's not terribly related to choice of concurrency strategy; just 
an interesting note for anyone following along the Saga Of Rust Frame Info)

Our argument yesterday hit a breaking point when we were discussing the 
relationship between C++ unwind semantics and pthread cancellation. I 
think this was actually a red herring: 'fail' could only really map to 
'pthread_cancel' in the case that we were hard-wired to a 1:1 scheduling 
model, always using a kernel thread for a task, which as I've said I 
would like to retain as only as an option (when on an OS with cheap / 
fast kernel threads) rather than a semantic requirement. And even if we 
*did* swallow that requirement, I was arguing yesterday (and I believe 
further googling today supports) the contention that pthread_cancel is 
just not a good fit for 'fail' (or kill). It will:

   (a) Only cancel at cancellation points, not "any instruction"; so it's
       probing the presence of a flag in the pthread library anyways. Not
       terribly different, cost-wise, from using our own flag.

   (b) Not run the C++ unwinder in a reliable, portable fashion;
       on some platforms this works but on some it does not, and boost
       has opted to not-present the interface for precisely this reason.

Overall I don't think pthread_cancel was a productive avenue for the 
discussion and I'm sorry we wound up in it. I think it's more useful to 
talk about ways to make cooperative scheduling (== kill) points cheap 
enough to live with -- including options for unsafe loops that omit them 
-- even in the degenerate case where they always say "keep running the 
task you're running" (1:1 mapping to threads). At least until killed.

Phew! Ok. Done. Comments? Preferences? Field datapoints to contribute?

-Graydon


More information about the Rust-dev mailing list