[rust-dev] The future of M:N threading

Daniel Micay danielmicay at gmail.com
Wed Nov 13 02:45:19 PST 2013


Before getting right into the gritty details about why I think we should think
about a path away from M:N scheduling, I'll go over the details of the
concurrency model we currently use.

Rust uses a user-mode scheduler to cooperatively schedule many tasks onto OS
threads. Due to the lack of preemption, tasks need to manually yield control
back to the scheduler. Performing I/O with the standard library will block the
*task*, but yield control back to the scheduler until the I/O is completed.

The scheduler manages a thread pool where the unit of work is a task rather
than a queue of closures to be executed or data to be pass to a function. A
task consists of a stack, register context and task-local storage much like an
OS thread.

In the world of high-performance computing, this is a proven model for
maximizing throughput for CPU-bound tasks. By abandoning preemption, there's
zero overhead from context switches. For socket servers with only negligible
server-side computations the avoidance of context switching is a boon for
scalability and predictable performance.

# Lightweight?

Rust's tasks are often called *lightweight* but at least on Linux the only
optimization is the lack of preemption. Since segmented stacks have been
dropped, the resident/virtual memory usage will be identical.

# Spawning performance

An OS thread can actually spawn nearly as fast as a Rust task on a system with
one CPU. On a multi-core system, there's a high chance of the new thread being
spawned on a different CPU resulting in a performance loss.

Sample C program, if you need to see it to believe it:

```
#include <pthread.h>
#include <err.h>

static const size_t n_thread = 100000;

static void *foo(void *arg) {
    return arg;
}

int main(void) {
    for (size_t i = 0; i < n_thread; i++) {
        pthread_attr_t attr;
        if (pthread_attr_init(&attr) < 0) {
            return 1;
        }
        if (pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED) < 0) {
            return 1;
        }
        pthread_t thread;
        if (pthread_create(&thread, &attr, foo, NULL) < 0) {
            return 1;
        }
    }
    pthread_exit(NULL);
}
```

Sample Rust program:

```
fn main() {
    for _ in range(0, 100000) {
        do spawn {
        }
    }
}
```

For both programs, I get around 0.9s consistently when pinned to a core. The
Rust version drops to 1.1s when not pinned and the OS thread one to about 2s.
It drops further when asked to allocate 8MiB stacks like C is doing, and will
drop more when it has to do `mmap` and `mprotect` calls like the pthread API.

# Asynchronous I/O

Rust's requirements for asynchronous I/O would be filled well by direct usage
of IOCP on Windows. However, Linux only has solid support for non-blocking
sockets because file operations usually just retrieve a result from cache and
do not truly have to block. This results in libuv being significantly slower
than blocking I/O for most common cases for the sake of scalable socket
servers.

On modern systems with flash memory, including mobile, there is a *consistent*
and relatively small worst-case latency for accessing data on the disk so
blocking is essentially a non-issue. Memory mapped I/O is also an incredibly
important feature for I/O performance, and there's almost no reason to use
traditional I/O on 64-bit. However, it's a no-go with M:N scheduling because
the page faults block the thread.

# Overview

Advantages:

* lack of preemptive/fair scheduling, leading to higher throughput
* very fast context switches to other tasks on the same scheduler thread

Disadvantages:

* lack of preemptive/fair scheduling (lower-level model)
* poor profiler/debugger support
* async I/O stack is much slower for the common case; for example stat is 35x
  slower when run in a loop for an mlocate-like utility
* true blocking code will still block a scheduler thread
* most existing libraries use blocking I/O and OS threads
* cannot directly use fast and easy to use linker-supported thread-local data
* many existing libraries rely on thread-local storage, so there's a need to be
  wary of hidden yields in Rust function calls and it's very difficult to
  expose a safe interface to these libraries
* every level of a CPU architecture adding registers needs explicit support
  from Rust, and it must be selected at runtime when not targeting a specific
  CPU (this is currently not done correctly)

# User-mode scheduling

Windows 7 introduced user-mode scheduling[1] to replace fibers on 64-bit.
Google implemented the same thing for Linux (perhaps even before Windows 7 was
released), and plans on pushing for it upstream.[2] The linked video does a
better job of covering this than I can.

User-mode scheduling provides a 1:1 threading model including full support for
normal thread-local data and existing debuggers/profilers. It can yield to the
scheduler on system calls and page faults. The operating system is responsible
for details like context switching, so a large maintenance/portability burden
is dealt with. It narrows down the above disadvantage list to just the point
about not having preemptive/fair scheduling and doesn't introduce any new ones.

I hope this is where concurrency is headed, and I hope Rust doesn't miss this
boat by concentrating too much on libuv. I think it would allow us to simply
drop support for pseudo-blocking I/O in the Go style and ignore asynchronous
I/O and non-blocking sockets in the standard library. It may be useful to have
the scheduler use them, but it wouldn't be essential.

[1] http://msdn.microsoft.com/en-us/library/windows/desktop/dd627187(v=vs.85).aspx
[2] http://www.youtube.com/watch?v=KXuZi9aeGTw


More information about the Rust-dev mailing list