[rust-dev] Performance of task switching using the new Rust scheduler

Brian Anderson banderson at mozilla.com
Sat May 18 15:27:31 PDT 2013

The new scheduler is still far from complete, but there is enough in 
place that we can start to get a sense for its performance 
characteristics. I've done a simulation of making message-passing calls 
into a layout task and have some preliminary numbers.

The new scheduler is a work-stealing scheduler. It generally executes 
tasks greedily, so when it sends a message and sees that the other task 
is already waiting to receive, it will context switch directly to the 
other task and process the message, deferring the running task to a 
work-stealing deque (where it may be stolen by other threads). The 
scheduler does not actually do multithreading yet, but it does have the 
parallel work queue in place. What I've tested is the single-threaded 
scenario where the script task can context switch directly to layout. 
This won't be a perfect test because in the full scheduler you might 
have other threads that can steal the dormant script task and e.g. run 
the GC, but I think we're in the ballpark of the overall cost model.

This test makes 50000 calls to a no-op layout function, either through a 
task or as a direct function call, then calculates the average time. The 
cross-task call amounts to two send/recv pairs. The source is here: 
Requires my 'io' branch of Rust.

# The numbers

function call ns per call:                                    2
new scheduler cross-task call ns per call:   2217
old scheduler cross-task call ns per call:   14737

The [profile] is encouraging in that most of the hottest functions are 
not the synchronization and context switching functions that I expect to 
be our limiting factors, but this performance is still probably on the 
order we might ultimately expect.

[profile]: https://gist.github.com/brson/5605807

Various observations:

* The accounting we do on the global heap, just atomic atomic inc/dec 
pairs, has a huge cost. This is the only thing that the Rust 
global_heap::mallac/free functions do before deferring to libc via 
fast_ffi and these are the two functions at the profile. We're going to 
have to turn this off eventually.

* A large amount of time is spent in malloc/free. There are two main 
offenders here: one is the uv idle callback used in the scheduler, which 
is trivially optimizable, and the other is the pipes allocation (they 
allocate on send). I suspect that the protocol between script and layout 
is bounded so we can use non-allocating message abstraction here (this 
could be an argument for keeping the pipes compiler but we'll have to 
think about it more).

* The time hitting thread-local-storage will be minimized once we add 
split stacks and can avoid the pthread_getspecific FFI call by using the 
TCB directly. There are some other hotspots relating to FFI calls around 
TLS, like `rust_get_tls_key` which doesn't need to exist.

* The `run_cleanup_job` function is one I've been worried about and it 
shows up high in the profile. This function executes a single command 
after every context switch on behalf of the previous context and 
contains a virtual call.

* The `shim` functions will go away with pending FFI changes.

* `try_recv` is not fully optimized and currently always costs 2 context 
switches and an atomic swap with full fence. This can be optimized in a 
way that would remove two context switches from one of the receives here.

* `send` and `try_recv::anon` are both dominated by atomic swaps.

* The work queue here is implemented with a lock, which in this test is 
never contested. The final work queue will be a Chase/Lev lock free 
deque. The costs will be different.

* The scheduler doesn't yet have any facility for waking sleeping 
schedulers. This will have some performance impact on the receive path.

Finally, for the curious I've listed here the exact sequence of events 
that happens in a single call to layout. I've left out the calls to 
thread-local storage because I don't think they are particularly 

# The request

* script task allocates a new 'oneshot' pipe that layout will use to 
send the response.
* script task allocates a new buffer as part of the 'stream' send.
* script task places the message payload into the pipe buffer.
* script task does an atomic swap with full fence to take the ~Task 
pointer of the blocked layout task from the 'stream'.
* script task context switches to the layout task.
* layout task (on behalf of the scheduler) pushes script task onto the 
work queue, here taking and dropping an uncontested mutex.
* layout task (on behalf of the scheduler) schedules an event loop idle 
callback to check up on the work queue later.
* layout task takes the message payload.
* layout task performs layout.

# The response

* layout task places the response message payload into the 'oneshot' 
buffer send by script.
* layout does an atomic swap with full fence to check the status of the 
script task and sees that it is not waiting for the response message 
(it's in the work queue).
* layout goes to wait for the next message from script...
* layout context switches to the scheduler to recv the next request.
* layout (in scheduler context) does an atomic swap with full fence to 
place it's ~Task pointer into the pipe buffer and discover that no 
message is available. Layout is now blocked.
* the scheduler drops back into the uv event loop.
* the scheduler wakes up later in response to the idle callback.
* the scheduler takes the work queue lock.
* the scheduler takes the script task out of the work queue.
* the scheduler drops the work queue lock.
* the scheduler context switches to the script task.
* the script task (as part of the unoptimized try_recv) context switches 
*back* to scheduler context to recv.
* the script task (in scheduler context) does an atomic swap with full 
fence and sees that the response payload is available
* the script task context switches back to its own context and takes the 
response payload.

More information about the Rust-dev mailing list