Shared memory performance #126

kyleabeauchamp · 2017-05-10T01:31:56Z

Does anyone (e.g., @ilveroluca or @avilella) have any thoughts on the performance of bwa mem when run with a shared memory index (bwa shm)? I've found there to be a 24% performance penalty when using a pre-loaded index, which to my naive mind indicates something either with either increased cache misses or suboptimal virtual memory paging (possibly related to MMAP flags)? Ideally, I would love for the the pre-loaded index to improve performance due to the overall decreased amount of RAM usage, reduced amount of time spent on IO, and increased flexibility for threading / multiplexing.

Does this number seem "reasonable" to others who have thought more carefully about memory management?

My benchmark code is below. FWIW, I've observed similar behavior on both OSX and linux.

wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz

time bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz > /dev/null
[M::bwa_idx_load_from_disk] ...
[...]
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz
[main] Real time: 146.956 sec; CPU: 1690.897 sec

bwa shm ref.bwa_mem.fa
time bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz > /dev/null
[M::main_mem] load the bwa index from shared memory
[...]
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz
[main] Real time: 182.335 sec; CPU: 2153.612 sec

The text was updated successfully, but these errors were encountered:

ilveroluca · 2017-05-11T11:14:00Z

That's interesting. Have you repeated the test, always getting similar results?

Some time ago I implemented a simpler approach, that directly accesses the reference with mmap without a POSIX shared memory object (PR #40 which, incidently, I still use). I remember that to ensure good performance I had to make sure that reference files were loaded all at once (with the MAP_POPULATE flag; without it you'd end up with random disk accesses as the alignment hit random sections of the reference. Also, if the bwa shm code isn't locking the reference in memory (and I don't think it is) parts of it may be swapped out, causing page faults as the alignment runs.

kyleabeauchamp · 2017-05-11T15:04:45Z

I have repeated the test with several threading settings and always get the same answer.

kyleabeauchamp · 2017-05-21T17:22:38Z

FWIW, I've tried adjusting the various flags (e.g., MAP_POPULATE) to make bwa shm behave more like #40. However, I was never able to get any improvements.

I also tried running #40, but I saw some segfaults when running that branch so I was never able to get a comparable benchmarking against bwa shm.

I'm definitely not an expert on unix memory management, however, so it's still possible that someone with more experience could uncover some performance gains here.

ihaque-freenome · 2017-06-01T21:36:18Z

What happens if you run a second bwa mem after the first (ie, bwa shm, bwa mem, bwa mem)? I wonder if the slowdown the first time is caused by paging-in as @ilveroluca suggested and the second time would work better with the reference hot in memory.

ihaque-freenome · 2017-08-29T22:25:28Z

@kyleabeauchamp you say that you can reproduce this on OSX, so it probably isn't the issue, but if you're running bwa on multi-socket servers, you may want to see if you're getting hit by NUMA issues: https://www.systutorials.com/docs/linux/man/8-numactl/

kyleabeauchamp · 2018-02-24T23:04:53Z

PS: I never answered some of the follow-up questions on this thread, so here it goes. I can confirm that the slower performance of shm is not resolved by doing a burn-in run to get the reference hot in memory. Regarding NUMA, I believe I've seen similar behavior on single CPU machines so I think that's probably not it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared memory performance #126

Shared memory performance #126

kyleabeauchamp commented May 10, 2017

ilveroluca commented May 11, 2017

kyleabeauchamp commented May 11, 2017

kyleabeauchamp commented May 21, 2017

ihaque-freenome commented Jun 1, 2017

ihaque-freenome commented Aug 29, 2017

kyleabeauchamp commented Feb 24, 2018

Shared memory performance #126

Shared memory performance #126

Comments

kyleabeauchamp commented May 10, 2017

ilveroluca commented May 11, 2017

kyleabeauchamp commented May 11, 2017

kyleabeauchamp commented May 21, 2017

ihaque-freenome commented Jun 1, 2017

ihaque-freenome commented Aug 29, 2017

kyleabeauchamp commented Feb 24, 2018