[rust-dev] Appeal for CORRECT, capable, future-proof math, pre-1.0

Daniel Micay danielmicay at gmail.com
Sat Jan 11 13:42:40 PST 2014


On Sat, Jan 11, 2014 at 4:31 PM, Owen Shepherd <owen.shepherd at e43.eu> wrote:
> So I just did a test. Took the following rust code:
> pub fn test_wrap(x : u32, y : u32) -> u32 {
>     return x.checked_mul(&y).unwrap().checked_add(&16).unwrap();
> }
>
> And got the following blob of assembly out. What we have there, my friends,
> is a complete failure of the optimizer (N.B. it works for the simple case of
> checked_add alone)
>
> Preamble:
>
> __ZN9test_wrap19hc4c136f599917215af4v0.0E:
>     .cfi_startproc
>     cmpl    %fs:20, %esp
>     ja    LBB0_2
>     pushl    $12
>     pushl    $20
>     calll    ___morestack
>     ret
> LBB0_2:
>     pushl    %ebp
> Ltmp2:
>     .cfi_def_cfa_offset 8
> Ltmp3:
>     .cfi_offset %ebp, -8
>     movl    %esp, %ebp
> Ltmp4:
>     .cfi_def_cfa_register %ebp
>
> Align stack (for what? We don't do any SSE)
>
>     andl    $-8, %esp
>     subl    $16, %esp

The compiler aligns the stack for performance.

> Multiply x * y
>
>     movl    12(%ebp), %eax
>     mull    16(%ebp)
>     jno    LBB0_4
>
> If it didn't overflow, stash a 0 at top of stack
>
>     movb    $0, (%esp)
>     jmp    LBB0_5
>
> If it did overflow, stash a 1 at top of stack (we are building an
> Option<u32> here)
> LBB0_4:
>     movb    $1, (%esp)
>     movl    %eax, 4(%esp)
>
> Take pointer to &this for __thiscall:
> LBB0_5:
>     leal    (%esp), %ecx
>     calll    __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E
>
> Do the addition to the result
>
>     addl    $16, %eax
>
> Repeat the previous circus
>
>     jae    LBB0_7
>     movb    $0, 8(%esp)
>     jmp    LBB0_8
> LBB0_7:
>     movb    $1, 8(%esp)
>     movl    %eax, 12(%esp)
> LBB0_8:
>     leal    8(%esp), %ecx
>     calll    __ZN6option6Option6unwrap21h05c5cb6c47a61795Zcat4v0.0E
>     movl    %ebp, %esp
>     popl    %ebp
>     ret
>     .cfi_endproc
>
>
> Yeah. Its' not fast because its' not inlining through option::unwrap.

The code to initiate failure is gigantic and LLVM doesn't do partial
inlining by default. It's likely far above the inlining threshold.

> I'm not sure what can be done for this, and whether its' on the LLVM side or
> the Rust side of things. My first instinct: find out what happens when fail!
> is moved out-of-line from unwrap() into its' own function (especially if
> that function can be marked noinline!), because optimizers often choke
> around EH.

I was testing with `rust-core` and calling `abort`, as it doesn't use unwinding.

> I tried to test the "optimal" situation in a synthetic benchmark:
> https://gist.github.com/oshepherd/8376705
> (In C for expediency. N.B. you must set core affinity before running this
> benchmark because I hackishly just read the TSC. i386 only.)
>
>
> but the results are really bizzare and seem to have a multitude of affecting
> factors (For example, if you minimally unroll and have the JCs jump straight
> to abort, you get vastly different performance from jumping to a closer
> location and then onwards to abort. Bear in mind that the overflow case
> never happens during the test). It would be interesting to do a test in
> which a "trivial" implementation of trap-on-overflow is added to rustc
> (read: the overflow case just jumps straight to abort or similar, to
> minimize optimizer influence and variability) to see how defaulting to
> trapping ints affects real world workloads.
>
> I wonder what level of performance impact would be considered "acceptable"
> for improved safety by default?
>
> Mind you, I think that what I'd propose is that i32 = Trapping, i32w =
> wrapping, i32s = saturating, or something similar

A purely synthetic benchmark only executing the unchecked or checked
instruction isn't interesting. You need to include several
optimizations in the loop as real code would use, and you will often
see a massive drop in performance from the serialization of the
pipeline. Register renaming is not as clever as you'd expect.

The impact of trapping is known, because `clang` and `gcc` expose `-ftrapv`.
 Integer-heavy workloads like cryptography and video codecs are
several times slower with the checks.


More information about the Rust-dev mailing list