[rust-dev] Appeal for CORRECT, capable, future-proof math, pre-1.0

Owen Shepherd owen.shepherd at e43.eu
Sat Jan 11 13:31:00 PST 2014

So I just did a test. Took the following rust code:
pub fn test_wrap(x : u32, y : u32) -> u32 {
    return x.checked_mul(&y).unwrap().checked_add(&16).unwrap();

And got the following blob of assembly out. What we have there, my friends,
is a complete failure of the optimizer (N.B. it works for the simple case
of checked_add alone)


    cmpl    %fs:20, %esp
    ja    LBB0_2
    pushl    $12
    pushl    $20
    calll    ___morestack
    pushl    %ebp
    .cfi_def_cfa_offset 8
    .cfi_offset %ebp, -8
    movl    %esp, %ebp
    .cfi_def_cfa_register %ebp

Align stack (for what? We don't do any SSE)

    andl    $-8, %esp
    subl    $16, %esp

Multiply x * y

    movl    12(%ebp), %eax
    mull    16(%ebp)
    jno    LBB0_4

If it didn't overflow, stash a 0 at top of stack

    movb    $0, (%esp)
    jmp    LBB0_5

If it did overflow, stash a 1 at top of stack (we are building an
Option<u32> here)
    movb    $1, (%esp)
    movl    %eax, 4(%esp)

Take pointer to &this for __thiscall:
    leal    (%esp), %ecx
    calll    __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E

Do the addition to the result

    addl    $16, %eax

Repeat the previous circus

    jae    LBB0_7
    movb    $0, 8(%esp)
    jmp    LBB0_8
    movb    $1, 8(%esp)
    movl    %eax, 12(%esp)
    leal    8(%esp), %ecx
    calll    __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E
    movl    %ebp, %esp
    popl    %ebp

Yeah. Its' not fast because its' not inlining through option::unwrap.

I'm not sure what can be done for this, and whether its' on the LLVM side
or the Rust side of things. My first instinct: find out what happens when
fail! is moved out-of-line from unwrap() into its' own function (especially
if that function can be marked noinline!), because optimizers often choke
around EH.

I tried to test the "optimal" situation in a synthetic benchmark:
(In C for expediency. N.B. you must set core affinity before running this
benchmark because I hackishly just read the TSC. i386 only.)

but the results are really bizzare and seem to have a multitude of
affecting factors (For example, if you minimally unroll and have the JCs
jump straight to abort, you get vastly different performance from jumping
to a closer location and then onwards to abort. Bear in mind that the
overflow case never happens during the test). It would be interesting to do
a test in which a "trivial" implementation of trap-on-overflow is added to
rustc (read: the overflow case just jumps straight to abort or similar, to
minimize optimizer influence and variability) to see how defaulting to
trapping ints affects real world workloads.

I wonder what level of performance impact would be considered "acceptable"
for improved safety by default?

Mind you, I think that what I'd propose is that i32 = Trapping, i32w =
wrapping, i32s = saturating, or something similar

Owen Shepherd
http://owenshepherd.net | owen.shepherd at e43.eu

On 11 January 2014 19:33, Daniel Micay <danielmicay at gmail.com> wrote:

> On Sat, Jan 11, 2014 at 11:54 AM, Owen Shepherd <owen.shepherd at e43.eu>
> wrote:
> > On 11 January 2014 06:20, Daniel Micay <danielmicay at gmail.com> wrote:
> >>
> >> The branch on the overflow flag results in a very significant loss in
> >> performance. For example, I had to carefully write the vector `push`
> >> method for my `Vec<T>` type to only perform one overflow check. With
> >> two checks, it's over 5 times slower due to failed branch predictions.
> >
> >
> > What did the generated code look like? I suspect that LLVM wasn't
> generating
> > optimal code, perhaps because Rust wasn't giving it appropriate hints or
> > because of optimizer bugs. For reference, on AMD64 the code should look
> > something like the following hypothetical code:
> >
> > vec_allocate:
> > MOV $SIZE, %eax
> > MUL %rsi
> > JC Lerror
> > ADD $HEADER_SIZE, %rax
> > JC Lerror
> > MOV %rax, %rsi
> > JMP malloc
> > Lerror:
> > // Code to raise error here
> >
> > Note that the ordering is EXTREMELY important! x86 doesn't give you any
> > separate branch hints (excluding two obsolete ones which only the
> Pentium IV
> > ever cared about) so your only clue to the optimizer is the branch
> > direction.
> >
> > I suspect your generated code had forward branches for the no overflow
> case.
> > Thats absolutely no good (codegen inerting "islands" of failure case
> code);
> > it will screw up the branch predictor.
> >
> > x86 defaults to predicting all (conditional) forward jumps not taken, all
> > conditional backwards jumps taken (Loops!). If the optimizer wasn't
> informed
> > correctly, it will probably not have obeyed that.
> >
> > Being as the overflow case should basically be never hit, there is no
> reason
> > for it to ever be loaded into the optimizer, so that is good
> >
> > (P.S. If the rust compiler is really good it'll convince LLVM to put the
> > error case branch code in a separate section so it can all be packed
> > together far away from useful cache lines and TLB entries)
> Rust directly exposes the checked overflow intrinsics so these are
> what was used. It already considers branches calling a `noreturn`
> function to be colder, so adding an explicit branch hint (which is
> easy enough via `llvm.expect` doesn't help). Feel free to implement it
> yourself if you think you can do better. The compiler work is already
> implemented.  I doubt you'll get something performing in the same
> ballpark as plain integers.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rust-dev/attachments/20140111/24d52dbd/attachment.html>

More information about the Rust-dev mailing list