[rust-dev] Performance of task switching using the new Rust scheduler
Brian Anderson
banderson at mozilla.com
Sat May 18 15:27:31 PDT 2013
The new scheduler is still far from complete, but there is enough in
place that we can start to get a sense for its performance
characteristics. I've done a simulation of making message-passing calls
into a layout task and have some preliminary numbers.
The new scheduler is a work-stealing scheduler. It generally executes
tasks greedily, so when it sends a message and sees that the other task
is already waiting to receive, it will context switch directly to the
other task and process the message, deferring the running task to a
work-stealing deque (where it may be stolen by other threads). The
scheduler does not actually do multithreading yet, but it does have the
parallel work queue in place. What I've tested is the single-threaded
scenario where the script task can context switch directly to layout.
This won't be a perfect test because in the full scheduler you might
have other threads that can steal the dormant script task and e.g. run
the GC, but I think we're in the ballpark of the overall cost model.
This test makes 50000 calls to a no-op layout function, either through a
task or as a direct function call, then calculates the average time. The
cross-task call amounts to two send/recv pairs. The source is here:
https://github.com/brson/rust-sched-bench/blob/master/coroutine-call.rs.
Requires my 'io' branch of Rust.
# The numbers
function call ns per call: 2
new scheduler cross-task call ns per call: 2217
old scheduler cross-task call ns per call: 14737
The [profile] is encouraging in that most of the hottest functions are
not the synchronization and context switching functions that I expect to
be our limiting factors, but this performance is still probably on the
order we might ultimately expect.
[profile]: https://gist.github.com/brson/5605807
Various observations:
* The accounting we do on the global heap, just atomic atomic inc/dec
pairs, has a huge cost. This is the only thing that the Rust
global_heap::mallac/free functions do before deferring to libc via
fast_ffi and these are the two functions at the profile. We're going to
have to turn this off eventually.
* A large amount of time is spent in malloc/free. There are two main
offenders here: one is the uv idle callback used in the scheduler, which
is trivially optimizable, and the other is the pipes allocation (they
allocate on send). I suspect that the protocol between script and layout
is bounded so we can use non-allocating message abstraction here (this
could be an argument for keeping the pipes compiler but we'll have to
think about it more).
* The time hitting thread-local-storage will be minimized once we add
split stacks and can avoid the pthread_getspecific FFI call by using the
TCB directly. There are some other hotspots relating to FFI calls around
TLS, like `rust_get_tls_key` which doesn't need to exist.
* The `run_cleanup_job` function is one I've been worried about and it
shows up high in the profile. This function executes a single command
after every context switch on behalf of the previous context and
contains a virtual call.
* The `shim` functions will go away with pending FFI changes.
* `try_recv` is not fully optimized and currently always costs 2 context
switches and an atomic swap with full fence. This can be optimized in a
way that would remove two context switches from one of the receives here.
* `send` and `try_recv::anon` are both dominated by atomic swaps.
* The work queue here is implemented with a lock, which in this test is
never contested. The final work queue will be a Chase/Lev lock free
deque. The costs will be different.
* The scheduler doesn't yet have any facility for waking sleeping
schedulers. This will have some performance impact on the receive path.
Finally, for the curious I've listed here the exact sequence of events
that happens in a single call to layout. I've left out the calls to
thread-local storage because I don't think they are particularly
illuminating.
# The request
* script task allocates a new 'oneshot' pipe that layout will use to
send the response.
* script task allocates a new buffer as part of the 'stream' send.
* script task places the message payload into the pipe buffer.
* script task does an atomic swap with full fence to take the ~Task
pointer of the blocked layout task from the 'stream'.
* script task context switches to the layout task.
* layout task (on behalf of the scheduler) pushes script task onto the
work queue, here taking and dropping an uncontested mutex.
* layout task (on behalf of the scheduler) schedules an event loop idle
callback to check up on the work queue later.
* layout task takes the message payload.
* layout task performs layout.
# The response
* layout task places the response message payload into the 'oneshot'
buffer send by script.
* layout does an atomic swap with full fence to check the status of the
script task and sees that it is not waiting for the response message
(it's in the work queue).
* layout goes to wait for the next message from script...
* layout context switches to the scheduler to recv the next request.
* layout (in scheduler context) does an atomic swap with full fence to
place it's ~Task pointer into the pipe buffer and discover that no
message is available. Layout is now blocked.
* the scheduler drops back into the uv event loop.
* the scheduler wakes up later in response to the idle callback.
* the scheduler takes the work queue lock.
* the scheduler takes the script task out of the work queue.
* the scheduler drops the work queue lock.
* the scheduler context switches to the script task.
* the script task (as part of the unoptimized try_recv) context switches
*back* to scheduler context to recv.
* the script task (in scheduler context) does an atomic swap with full
fence and sees that the response payload is available
* the script task context switches back to its own context and takes the
response payload.
More information about the Rust-dev
mailing list