Skip to content

Inline fail! causes significant slowdown in write() #14485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mahkoh opened this issue May 28, 2014 · 21 comments
Closed

Inline fail! causes significant slowdown in write() #14485

mahkoh opened this issue May 28, 2014 · 21 comments
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority

Comments

@mahkoh
Copy link
Contributor

mahkoh commented May 28, 2014

Consider the following snippet:

#[inline(never)]
fn fail() {
    fail!("assertion failed");
}

impl<'a, W: Writer> Writer for UnsafeWriter<'a, W> {
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        let dst = self.buf.mut_slice_from(self.pos);
        if buf.len() > dst.len() {
            fail();
        }
        unsafe {
            copy_nonoverlapping_memory(dst.as_mut_ptr(), buf.as_ptr(), buf.len())
        }
        self.pos += buf.len();
        Ok(())
    }

    fn flush(&mut self) -> IoResult<()> {
        self.flush_buf().and_then(|()| self.inner.flush())
    }
}

Inlining fail() causes significant slowdown in the following program

https://gist.github.com/anonymous/48ba255d89c2f9bf636c

Inlined:

real    0m3.256s
user    0m3.237s
sys 0m0.020s

Non-inlined:

real    0m1.562s
user    0m1.537s
sys 0m0.023s

Compiled with -O. The program is tested with time ./cat -v /tmp/blabla.txt > /dev/null where the text file are 100MB of random data.

Note that core::slice::MutableVector::copy_memory has the form

#[inline]
    unsafe fn copy_memory(self, src: &[T]) {
        let len_src = src.len();
        assert!(self.len() >= len_src);
        ptr::copy_nonoverlapping_memory(self.as_mut_ptr(), src.as_ptr(), len_src)
    }

where assert! expands to something similar to the inlined fail!() version above. This function is used by some Writers.

cc @cmr

@mahkoh
Copy link
Contributor Author

mahkoh commented May 28, 2014

perf non-inlined

~/rust/coreutils/cat$ perf stat ./cat -v /tmp/blabla.txt > /dev/null

 Performance counter stats for './cat -v /tmp/blabla.txt':

       1598.236550      task-clock (msec)         #    0.999 CPUs utilized          
                12      context-switches          #    0.008 K/sec                  
                26      cpu-migrations            #    0.016 K/sec                  
               313      page-faults               #    0.196 K/sec                  
     3,356,501,774      cycles                    #    2.100 GHz                    
     1,378,688,552      stalled-cycles-frontend   #   41.08% frontend cycles idle   
     1,012,720,553      stalled-cycles-backend    #   30.17% backend  cycles idle   
     4,485,975,285      instructions              #    1.34  insns per cycle        
                                                  #    0.31  stalled cycles per insn
     1,435,550,790      branches                  #  898.209 M/sec                  
        76,588,047      branch-misses             #    5.34% of all branches        

       1.599360146 seconds time elapsed

perf inlined:

~/rust/coreutils/cat$ perf stat ./cat -v /tmp/blabla.txt > /dev/null

 Performance counter stats for './cat -v /tmp/blabla.txt':

       3283.608533      task-clock (msec)         #    1.001 CPUs utilized          
                 7      context-switches          #    0.002 K/sec                  
                21      cpu-migrations            #    0.006 K/sec                  
               336      page-faults               #    0.102 K/sec                  
     7,105,046,317      cycles                    #    2.164 GHz                    
     1,729,009,226      stalled-cycles-frontend   #   24.33% frontend cycles idle   
     1,119,307,210      stalled-cycles-backend    #   15.75% backend  cycles idle   
    12,545,235,483      instructions              #    1.77  insns per cycle        
                                                  #    0.14  stalled cycles per insn
     3,403,847,620      branches                  # 1036.618 M/sec                  
        78,237,041      branch-misses             #    2.30% of all branches        

       3.281030203 seconds time elapsed

nm non-inlined:

~/rust/coreutils/cat$ nm cat| grep "UnsafeWriter.*write"
~/rust/coreutils/cat$ 

nm inlined:

~/rust/coreutils/cat$ nm cat| grep "UnsafeWriter.*write"
000000000040c110 t _ZN40UnsafeWriter$LT$$x27a$C$$x20W$GT$.Writer5write21h130431865473282111524v0.0E

@huonw
Copy link
Member

huonw commented May 28, 2014

BTW, theoretically that .write should be returning an Err rather than failing to allow callers to handle the error.

@mahkoh
Copy link
Contributor Author

mahkoh commented May 28, 2014

In this particular program we always call unwrap() after write(). In fact: I'd rather replace all failing and unwinding by abort().

For comparison:

        if buf.len() > dst.len() {
            return Err(std::io::IoError { kind: std::io::OtherIoError, desc: "", detail: None });     
   }
real    0m1.670s
user    0m1.650s
sys 0m0.017s

        if buf.len() > dst.len() {
            return Err(std::io::standard_error(std::io::OtherIoError)); 
   }
real    0m1.790s
user    0m1.783s
sys 0m0.007s

        if buf.len() > dst.len() {
            unsafe { std::intrinsics::abort(); }
        }
real    0m1.675s
user    0m1.643s
sys 0m0.030s

        if buf.len() > dst.len() {
            unsafe { libc::exit(1); }
        }
real    0m1.567s
user    0m1.537s
sys 0m0.027s

The numbers are somewhat stable.

@emberian
Copy link
Member

cc @alexcrichton @thestinger

@mahkoh
Copy link
Contributor Author

mahkoh commented Jul 9, 2014

With the currently nightly:

real    0m1.665s
user    0m1.643s
sys 0m0.017s

Inlined:

real    0m1.716s
user    0m1.673s
sys 0m0.037s

@thestinger thestinger added I-slow Issue: Problems and improvements with respect to performance of generated code. I-compiletime Issue: Problems and improvements with respect to compile times. labels Sep 19, 2014
@thestinger
Copy link
Contributor

See #17386

@thestinger
Copy link
Contributor

Nominating because fail!() is a major performance issue and presents a backwards compatibility problem. I don't think it makes sense for failure to involve string formatting and injecting line numbers (which are usually useless, due to lack of context).

@nikomatsakis nikomatsakis changed the title Inline fail! causes significant slowdown Inline fail! causes significant slowdown in microbenchmark Sep 25, 2014
@nikomatsakis
Copy link
Contributor

Updated title to be more precise.

@pnkfelix
Copy link
Member

Assigning P-low, not 1.0.

(We should also investigate whether this benchmark still has the same performance problems, since there were some discussion that some of these paths have been improved.)

@pnkfelix pnkfelix added P-low Low priority and removed I-nominated labels Sep 25, 2014
@mahkoh
Copy link
Contributor Author

mahkoh commented Sep 25, 2014

cat -v is a real life application and it being over 100% slower when you use fail! has nothing to do with micro benchmarks.

@huonw
Copy link
Member

huonw commented Sep 25, 2014

Where is the 100% slower? #14485 (comment) doesn't suggest this; also, it's been a while since that measurement and IIRC there have been even more improvements since then.

@mahkoh
Copy link
Contributor Author

mahkoh commented Sep 25, 2014

It was 100% slower in the original post. If it is a micro benchmark now then it was a micro benchmark then. If anything, the "significant" part has been improved.

@zwarich
Copy link

zwarich commented Sep 25, 2014

Also, is the performance impact caused by instruction cache effects, poor register allocation of the inlined code, or something else? The perf results show that more instructions are being executed, with less pipeline stalling and branch misprediction, which suggests register allocation as a cause.

@huonw
Copy link
Member

huonw commented Sep 25, 2014

The historical performance is not relevant, we only care about the performance now; it's wrong and disingenuous to use the 100+% slower number as 'evidence' for something, if it is no longer actually the case.

@mahkoh
Copy link
Contributor Author

mahkoh commented Sep 25, 2014

I'm not using it as evidence for anything. The facts are very simple:

  • cat is a real life program
  • Trying to write a cat version that is as fast as possible so that it can possibly replace GNU cat is not an exercise in micro benchmarking
  • At some point, fail! made it impossible to use the built-in readers
  • There is still a noticeable slowdown when you inline fail

Trying to rewrite history by now calling it a micro benchmark doesn't make Rust's performance problems go away.

Here are some updated numbers:

Inlined: 1.793
Not inlined: 1.685

As you can see, both versions have gotten even slower since the last time I posted some numbers. But the inlined version has gotten "more worse" than the non-inlined version. I'll post perf results shortly.

@mahkoh
Copy link
Contributor Author

mahkoh commented Sep 25, 2014

Inlined:

       1776.939970      task-clock (msec)         #    0.994 CPUs utilized          
                65      context-switches          #    0.037 K/sec                  
                10      cpu-migrations            #    0.006 K/sec                  
               148      page-faults               #    0.083 K/sec                  
     3,827,770,223      cycles                    #    2.154 GHz                    
     1,616,251,336      stalled-cycles-frontend   #   42.22% frontend cycles idle   
     1,177,573,818      stalled-cycles-backend    #   30.76% backend  cycles idle   
     4,558,947,291      instructions              #    1.19  insns per cycle        
                                                  #    0.35  stalled cycles per insn
     1,393,158,726      branches                  #  784.021 M/sec                  
        80,200,854      branch-misses             #    5.76% of all branches        

       1.787362035 seconds time elapsed

Not inlined:

       1684.161520      task-clock (msec)         #    0.994 CPUs utilized          
                40      context-switches          #    0.024 K/sec                  
                17      cpu-migrations            #    0.010 K/sec                  
               147      page-faults               #    0.087 K/sec                  
     3,648,479,803      cycles                    #    2.166 GHz                    
     1,504,806,858      stalled-cycles-frontend   #   41.24% frontend cycles idle   
     1,044,159,728      stalled-cycles-backend    #   28.62% backend  cycles idle   
     4,556,741,666      instructions              #    1.25  insns per cycle        
                                                  #    0.33  stalled cycles per insn
     1,392,766,625      branches                  #  826.979 M/sec                  
        80,104,480      branch-misses             #    5.75% of all branches        

       1.693872718 seconds time elapsed

@aturon
Copy link
Member

aturon commented Sep 25, 2014

I think there was some misunderstanding during the meeting where this was discussed -- I agree with @markoh that the example code is representative of an application (part of the Rust coreutils implementation) and not a microbenchmark.

That said, it looks like the gap between the two versions has shrunk considerably.

@nikomatsakis nikomatsakis changed the title Inline fail! causes significant slowdown in microbenchmark Inline fail! causes significant slowdown Sep 26, 2014
@nikomatsakis nikomatsakis changed the title Inline fail! causes significant slowdown Inline fail! causes significant slowdown in write() Sep 26, 2014
@nikomatsakis
Copy link
Contributor

@mahkoh fair enough, I updated the title again. The main thing I was trying to do was having something more specific than "significant slowdown", full stop, not to denigrate the bug report. (Mostly because when skimming titles of bug reports, I wanted something that would jog my memory as to what specific situation was being described.)

@dotdash
Copy link
Contributor

dotdash commented Oct 8, 2014

This seems to have been resolved. With rustc 0.12.0-dev (57af34b9a 2014-10-08 05:32:09 +0000) I get

$ perf stat -r10 ./cat-noline -v ~/garbage  > /dev/null

 Performance counter stats for './cat-noline -v /home/bs/garbage' (10 runs):

        941.592477      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.78% )
                 1      context-switches          #    0.001 K/sec                    ( +- 11.11% )
                 1      cpu-migrations            #    0.001 K/sec                    ( +- 44.44% )
               146      page-faults               #    0.155 K/sec                    ( +-  0.31% )
     3,623,287,334      cycles                    #    3.848 GHz                      ( +-  0.18% )
     1,554,100,197      stalled-cycles-frontend   #   42.89% frontend cycles idle     ( +-  0.38% )
   <not supported>      stalled-cycles-backend   
     4,733,907,018      instructions              #    1.31  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.00% )
     1,366,195,900      branches                  # 1450.942 M/sec                    ( +-  0.00% )
        80,324,010      branch-misses             #    5.88% of all branches          ( +-  0.01% )

       0.941340757 seconds time elapsed                                          ( +-  0.78% )

and

$ perf stat -r10 ./cat-inline -v ~/garbage  > /dev/null

 Performance counter stats for './cat-inline -v /home/bs/garbage' (10 runs):

        931.492671      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.59% )
                 1      context-switches          #    0.001 K/sec                    ( +- 11.11% )
                 1      cpu-migrations            #    0.001 K/sec                    ( +- 30.77% )
               146      page-faults               #    0.157 K/sec                    ( +-  0.23% )
     3,609,375,727      cycles                    #    3.875 GHz                      ( +-  0.25% )
     1,548,644,880      stalled-cycles-frontend   #   42.91% frontend cycles idle     ( +-  0.70% )
   <not supported>      stalled-cycles-backend   
     4,733,869,621      instructions              #    1.31  insns per cycle        
                                                  #    0.33  stalled cycles per insn  ( +-  0.00% )
     1,366,189,092      branches                  # 1466.666 M/sec                    ( +-  0.00% )
        80,322,444      branch-misses             #    5.88% of all branches          ( +-  0.01% )

       0.931251603 seconds time elapsed                                          ( +-  0.59% )

@mahkoh can you confirm this?

@mahkoh
Copy link
Contributor Author

mahkoh commented Oct 8, 2014

They both take the same amount of time now, but the non-inlined version has again gotten slower: 0m1.727s

@thestinger
Copy link
Contributor

Closing in favour of #17386. It will always cause a performance hit in some scenarios if it's more than a function call.

bors added a commit to rust-lang-ci/rust that referenced this issue Dec 11, 2023
…rences, r=Veykril

fix: resolve Self type references in delegate method assist

This PR makes the delegate method assist resolve any `Self` type references in the parameters or return type. It also works across macros such as the `uint_impl!` macro used for `saturating_mul` in the issue example.

Closes rust-lang#14485
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. I-slow Issue: Problems and improvements with respect to performance of generated code. P-low Low priority
Projects
None yet
Development

No branches or pull requests

9 participants