Inline fail! causes significant slowdown in write() #14485

mahkoh · 2014-05-28T11:21:25Z

Consider the following snippet:

#[inline(never)]
fn fail() {
    fail!("assertion failed");
}

impl<'a, W: Writer> Writer for UnsafeWriter<'a, W> {
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        let dst = self.buf.mut_slice_from(self.pos);
        if buf.len() > dst.len() {
            fail();
        }
        unsafe {
            copy_nonoverlapping_memory(dst.as_mut_ptr(), buf.as_ptr(), buf.len())
        }
        self.pos += buf.len();
        Ok(())
    }

    fn flush(&mut self) -> IoResult<()> {
        self.flush_buf().and_then(|()| self.inner.flush())
    }
}

Inlining fail() causes significant slowdown in the following program

https://gist.github.com/anonymous/48ba255d89c2f9bf636c

Inlined:

real    0m3.256s
user    0m3.237s
sys 0m0.020s

Non-inlined:

real    0m1.562s
user    0m1.537s
sys 0m0.023s

Compiled with -O. The program is tested with time ./cat -v /tmp/blabla.txt > /dev/null where the text file are 100MB of random data.

Note that core::slice::MutableVector::copy_memory has the form

#[inline]
    unsafe fn copy_memory(self, src: &[T]) {
        let len_src = src.len();
        assert!(self.len() >= len_src);
        ptr::copy_nonoverlapping_memory(self.as_mut_ptr(), src.as_ptr(), len_src)
    }

where assert! expands to something similar to the inlined fail!() version above. This function is used by some Writers.

cc @cmr

The text was updated successfully, but these errors were encountered:

mahkoh · 2014-05-28T11:27:06Z

perf non-inlined

~/rust/coreutils/cat$ perf stat ./cat -v /tmp/blabla.txt > /dev/null

 Performance counter stats for './cat -v /tmp/blabla.txt':

       1598.236550      task-clock (msec)         #    0.999 CPUs utilized          
                12      context-switches          #    0.008 K/sec                  
                26      cpu-migrations            #    0.016 K/sec                  
               313      page-faults               #    0.196 K/sec                  
     3,356,501,774      cycles                    #    2.100 GHz                    
     1,378,688,552      stalled-cycles-frontend   #   41.08% frontend cycles idle   
     1,012,720,553      stalled-cycles-backend    #   30.17% backend  cycles idle   
     4,485,975,285      instructions              #    1.34  insns per cycle        
                                                  #    0.31  stalled cycles per insn
     1,435,550,790      branches                  #  898.209 M/sec                  
        76,588,047      branch-misses             #    5.34% of all branches        

       1.599360146 seconds time elapsed

perf inlined:

~/rust/coreutils/cat$ perf stat ./cat -v /tmp/blabla.txt > /dev/null

 Performance counter stats for './cat -v /tmp/blabla.txt':

       3283.608533      task-clock (msec)         #    1.001 CPUs utilized          
                 7      context-switches          #    0.002 K/sec                  
                21      cpu-migrations            #    0.006 K/sec                  
               336      page-faults               #    0.102 K/sec                  
     7,105,046,317      cycles                    #    2.164 GHz                    
     1,729,009,226      stalled-cycles-frontend   #   24.33% frontend cycles idle   
     1,119,307,210      stalled-cycles-backend    #   15.75% backend  cycles idle   
    12,545,235,483      instructions              #    1.77  insns per cycle        
                                                  #    0.14  stalled cycles per insn
     3,403,847,620      branches                  # 1036.618 M/sec                  
        78,237,041      branch-misses             #    2.30% of all branches        

       3.281030203 seconds time elapsed

nm non-inlined:

~/rust/coreutils/cat$ nm cat| grep "UnsafeWriter.*write"
~/rust/coreutils/cat$

nm inlined:

~/rust/coreutils/cat$ nm cat| grep "UnsafeWriter.*write"
000000000040c110 t _ZN40UnsafeWriter$LT$$x27a$C$$x20W$GT$.Writer5write21h130431865473282111524v0.0E

huonw · 2014-05-28T11:42:12Z

BTW, theoretically that .write should be returning an Err rather than failing to allow callers to handle the error.

mahkoh · 2014-05-28T12:00:33Z

In this particular program we always call unwrap() after write(). In fact: I'd rather replace all failing and unwinding by abort().

For comparison:

        if buf.len() > dst.len() {
            return Err(std::io::IoError { kind: std::io::OtherIoError, desc: "", detail: None });     
   }

real    0m1.670s
user    0m1.650s
sys 0m0.017s

        if buf.len() > dst.len() {
            return Err(std::io::standard_error(std::io::OtherIoError)); 
   }

real    0m1.790s
user    0m1.783s
sys 0m0.007s

        if buf.len() > dst.len() {
            unsafe { std::intrinsics::abort(); }
        }

real    0m1.675s
user    0m1.643s
sys 0m0.030s

        if buf.len() > dst.len() {
            unsafe { libc::exit(1); }
        }

real    0m1.567s
user    0m1.537s
sys 0m0.027s

The numbers are somewhat stable.

emberian · 2014-05-28T18:20:11Z

cc @alexcrichton @thestinger

mahkoh · 2014-07-09T14:07:21Z

With the currently nightly:

real    0m1.665s
user    0m1.643s
sys 0m0.017s

Inlined:

real    0m1.716s
user    0m1.673s
sys 0m0.037s

thestinger · 2014-09-19T22:02:54Z

See #17386

thestinger · 2014-09-24T19:17:25Z

Nominating because fail!() is a major performance issue and presents a backwards compatibility problem. I don't think it makes sense for failure to involve string formatting and injecting line numbers (which are usually useless, due to lack of context).

nikomatsakis · 2014-09-25T21:20:18Z

Updated title to be more precise.

pnkfelix · 2014-09-25T21:20:33Z

Assigning P-low, not 1.0.

(We should also investigate whether this benchmark still has the same performance problems, since there were some discussion that some of these paths have been improved.)

mahkoh · 2014-09-25T21:30:19Z

cat -v is a real life application and it being over 100% slower when you use fail! has nothing to do with micro benchmarks.

huonw · 2014-09-25T21:32:36Z

Where is the 100% slower? #14485 (comment) doesn't suggest this; also, it's been a while since that measurement and IIRC there have been even more improvements since then.

mahkoh · 2014-09-25T21:34:09Z

It was 100% slower in the original post. If it is a micro benchmark now then it was a micro benchmark then. If anything, the "significant" part has been improved.

zwarich · 2014-09-25T21:38:47Z

Also, is the performance impact caused by instruction cache effects, poor register allocation of the inlined code, or something else? The perf results show that more instructions are being executed, with less pipeline stalling and branch misprediction, which suggests register allocation as a cause.

huonw · 2014-09-25T21:45:17Z

The historical performance is not relevant, we only care about the performance now; it's wrong and disingenuous to use the 100+% slower number as 'evidence' for something, if it is no longer actually the case.

mahkoh · 2014-09-25T22:00:04Z

I'm not using it as evidence for anything. The facts are very simple:

cat is a real life program
Trying to write a cat version that is as fast as possible so that it can possibly replace GNU cat is not an exercise in micro benchmarking
At some point, fail! made it impossible to use the built-in readers
There is still a noticeable slowdown when you inline fail

Trying to rewrite history by now calling it a micro benchmark doesn't make Rust's performance problems go away.

Here are some updated numbers:

Inlined: 1.793
Not inlined: 1.685

As you can see, both versions have gotten even slower since the last time I posted some numbers. But the inlined version has gotten "more worse" than the non-inlined version. I'll post perf results shortly.

mahkoh · 2014-09-25T22:02:28Z

Inlined:

       1776.939970      task-clock (msec)         #    0.994 CPUs utilized          
                65      context-switches          #    0.037 K/sec                  
                10      cpu-migrations            #    0.006 K/sec                  
               148      page-faults               #    0.083 K/sec                  
     3,827,770,223      cycles                    #    2.154 GHz                    
     1,616,251,336      stalled-cycles-frontend   #   42.22% frontend cycles idle   
     1,177,573,818      stalled-cycles-backend    #   30.76% backend  cycles idle   
     4,558,947,291      instructions              #    1.19  insns per cycle        
                                                  #    0.35  stalled cycles per insn
     1,393,158,726      branches                  #  784.021 M/sec                  
        80,200,854      branch-misses             #    5.76% of all branches        

       1.787362035 seconds time elapsed

Not inlined:

       1684.161520      task-clock (msec)         #    0.994 CPUs utilized          
                40      context-switches          #    0.024 K/sec                  
                17      cpu-migrations            #    0.010 K/sec                  
               147      page-faults               #    0.087 K/sec                  
     3,648,479,803      cycles                    #    2.166 GHz                    
     1,504,806,858      stalled-cycles-frontend   #   41.24% frontend cycles idle   
     1,044,159,728      stalled-cycles-backend    #   28.62% backend  cycles idle   
     4,556,741,666      instructions              #    1.25  insns per cycle        
                                                  #    0.33  stalled cycles per insn
     1,392,766,625      branches                  #  826.979 M/sec                  
        80,104,480      branch-misses             #    5.75% of all branches        

       1.693872718 seconds time elapsed

aturon · 2014-09-25T22:30:17Z

I think there was some misunderstanding during the meeting where this was discussed -- I agree with @markoh that the example code is representative of an application (part of the Rust coreutils implementation) and not a microbenchmark.

That said, it looks like the gap between the two versions has shrunk considerably.

nikomatsakis · 2014-09-26T13:17:28Z

@mahkoh fair enough, I updated the title again. The main thing I was trying to do was having something more specific than "significant slowdown", full stop, not to denigrate the bug report. (Mostly because when skimming titles of bug reports, I wanted something that would jog my memory as to what specific situation was being described.)

dotdash · 2014-10-08T11:54:43Z

This seems to have been resolved. With rustc 0.12.0-dev (57af34b9a 2014-10-08 05:32:09 +0000) I get

$ perf stat -r10 ./cat-noline -v ~/garbage  > /dev/null

 Performance counter stats for './cat-noline -v /home/bs/garbage' (10 runs):

        941.592477      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.78% )
                 1      context-switches          #    0.001 K/sec                    ( +- 11.11% )
                 1      cpu-migrations            #    0.001 K/sec                    ( +- 44.44% )
               146      page-faults               #    0.155 K/sec                    ( +-  0.31% )
     3,623,287,334      cycles                    #    3.848 GHz                      ( +-  0.18% )
     1,554,100,197      stalled-cycles-frontend   #   42.89% frontend cycles idle     ( +-  0.38% )
   <not supported>      stalled-cycles-backend   
     4,733,907,018      instructions              #    1.31  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.00% )
     1,366,195,900      branches                  # 1450.942 M/sec                    ( +-  0.00% )
        80,324,010      branch-misses             #    5.88% of all branches          ( +-  0.01% )

       0.941340757 seconds time elapsed                                          ( +-  0.78% )

and

$ perf stat -r10 ./cat-inline -v ~/garbage  > /dev/null

 Performance counter stats for './cat-inline -v /home/bs/garbage' (10 runs):

        931.492671      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.59% )
                 1      context-switches          #    0.001 K/sec                    ( +- 11.11% )
                 1      cpu-migrations            #    0.001 K/sec                    ( +- 30.77% )
               146      page-faults               #    0.157 K/sec                    ( +-  0.23% )
     3,609,375,727      cycles                    #    3.875 GHz                      ( +-  0.25% )
     1,548,644,880      stalled-cycles-frontend   #   42.91% frontend cycles idle     ( +-  0.70% )
   <not supported>      stalled-cycles-backend   
     4,733,869,621      instructions              #    1.31  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.00% )
     1,366,189,092      branches                  # 1466.666 M/sec                    ( +-  0.00% )
        80,322,444      branch-misses             #    5.88% of all branches          ( +-  0.01% )

       0.931251603 seconds time elapsed                                          ( +-  0.59% )

@mahkoh can you confirm this?

mahkoh · 2014-10-08T12:07:33Z

They both take the same amount of time now, but the non-inlined version has again gotten slower: 0m1.727s

thestinger · 2014-10-09T06:15:35Z

Closing in favour of #17386. It will always cause a performance hit in some scenarios if it's more than a function call.

…rences, r=Veykril fix: resolve Self type references in delegate method assist This PR makes the delegate method assist resolve any `Self` type references in the parameters or return type. It also works across macros such as the `uint_impl!` macro used for `saturating_mul` in the issue example. Closes rust-lang#14485

thestinger added I-slow Issue: Problems and improvements with respect to performance of generated code. I-compiletime Issue: Problems and improvements with respect to compile times. labels Sep 19, 2014

thestinger added the I-nominated label Sep 24, 2014

nikomatsakis changed the title ~~Inline fail! causes significant slowdown~~ Inline fail! causes significant slowdown in microbenchmark Sep 25, 2014

pnkfelix added P-low Low priority and removed I-nominated labels Sep 25, 2014

nikomatsakis changed the title ~~Inline fail! causes significant slowdown in microbenchmark~~ Inline fail! causes significant slowdown Sep 26, 2014

nikomatsakis changed the title ~~Inline fail! causes significant slowdown~~ Inline fail! causes significant slowdown in write() Sep 26, 2014

thestinger closed this as completed Oct 9, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline fail! causes significant slowdown in write() #14485

Inline fail! causes significant slowdown in write() #14485

mahkoh commented May 28, 2014

mahkoh commented May 28, 2014

huonw commented May 28, 2014

mahkoh commented May 28, 2014

emberian commented May 28, 2014

mahkoh commented Jul 9, 2014

thestinger commented Sep 19, 2014

thestinger commented Sep 24, 2014

nikomatsakis commented Sep 25, 2014

pnkfelix commented Sep 25, 2014

mahkoh commented Sep 25, 2014

huonw commented Sep 25, 2014

mahkoh commented Sep 25, 2014

zwarich commented Sep 25, 2014

huonw commented Sep 25, 2014

mahkoh commented Sep 25, 2014

mahkoh commented Sep 25, 2014

aturon commented Sep 25, 2014

nikomatsakis commented Sep 26, 2014

dotdash commented Oct 8, 2014

mahkoh commented Oct 8, 2014

thestinger commented Oct 9, 2014

Inline fail! causes significant slowdown in write() #14485

Inline fail! causes significant slowdown in write() #14485

Comments

mahkoh commented May 28, 2014

mahkoh commented May 28, 2014

huonw commented May 28, 2014

mahkoh commented May 28, 2014

emberian commented May 28, 2014

mahkoh commented Jul 9, 2014

thestinger commented Sep 19, 2014

thestinger commented Sep 24, 2014

nikomatsakis commented Sep 25, 2014

pnkfelix commented Sep 25, 2014

mahkoh commented Sep 25, 2014

huonw commented Sep 25, 2014

mahkoh commented Sep 25, 2014

zwarich commented Sep 25, 2014

huonw commented Sep 25, 2014

mahkoh commented Sep 25, 2014

mahkoh commented Sep 25, 2014

aturon commented Sep 25, 2014

nikomatsakis commented Sep 26, 2014

dotdash commented Oct 8, 2014

mahkoh commented Oct 8, 2014

thestinger commented Oct 9, 2014