Skip to content

Shared memory performance #126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
kyleabeauchamp opened this issue May 10, 2017 · 6 comments
Open

Shared memory performance #126

kyleabeauchamp opened this issue May 10, 2017 · 6 comments

Comments

@kyleabeauchamp
Copy link

Does anyone (e.g., @ilveroluca or @avilella) have any thoughts on the performance of bwa mem when run with a shared memory index (bwa shm)? I've found there to be a 24% performance penalty when using a pre-loaded index, which to my naive mind indicates something either with either increased cache misses or suboptimal virtual memory paging (possibly related to MMAP flags)? Ideally, I would love for the the pre-loaded index to improve performance due to the overall decreased amount of RAM usage, reduced amount of time spent on IO, and increased flexibility for threading / multiplexing.

Does this number seem "reasonable" to others who have thought more carefully about memory management?

My benchmark code is below. FWIW, I've observed similar behavior on both OSX and linux.

wget ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/131219_D00360_005_BH814YADXX/Project_RM8398/Sample_U0a/U0a_CGATGT_L001_R1_001.fastq.gz

time bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz > /dev/null
[M::bwa_idx_load_from_disk] ...
[...]
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz
[main] Real time: 146.956 sec; CPU: 1690.897 sec

bwa shm ref.bwa_mem.fa
time bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz > /dev/null
[M::main_mem] load the bwa index from shared memory
[...]
[main] Version: 0.7.15-r1140
[main] CMD: bwa mem -t 12 ref.bwa_mem.fa U0a_CGATGT_L001_R1_001.fastq.gz
[main] Real time: 182.335 sec; CPU: 2153.612 sec
@ilveroluca
Copy link

That's interesting. Have you repeated the test, always getting similar results?

Some time ago I implemented a simpler approach, that directly accesses the reference with mmap without a POSIX shared memory object (PR #40 which, incidently, I still use). I remember that to ensure good performance I had to make sure that reference files were loaded all at once (with the MAP_POPULATE flag; without it you'd end up with random disk accesses as the alignment hit random sections of the reference. Also, if the bwa shm code isn't locking the reference in memory (and I don't think it is) parts of it may be swapped out, causing page faults as the alignment runs.

@kyleabeauchamp
Copy link
Author

I have repeated the test with several threading settings and always get the same answer.

@kyleabeauchamp
Copy link
Author

FWIW, I've tried adjusting the various flags (e.g., MAP_POPULATE) to make bwa shm behave more like #40. However, I was never able to get any improvements.

I also tried running #40, but I saw some segfaults when running that branch so I was never able to get a comparable benchmarking against bwa shm.

I'm definitely not an expert on unix memory management, however, so it's still possible that someone with more experience could uncover some performance gains here.

@ihaque-freenome
Copy link
Contributor

What happens if you run a second bwa mem after the first (ie, bwa shm, bwa mem, bwa mem)? I wonder if the slowdown the first time is caused by paging-in as @ilveroluca suggested and the second time would work better with the reference hot in memory.

@ihaque-freenome
Copy link
Contributor

@kyleabeauchamp you say that you can reproduce this on OSX, so it probably isn't the issue, but if you're running bwa on multi-socket servers, you may want to see if you're getting hit by NUMA issues: https://www.systutorials.com/docs/linux/man/8-numactl/

@kyleabeauchamp
Copy link
Author

PS: I never answered some of the follow-up questions on this thread, so here it goes. I can confirm that the slower performance of shm is not resolved by doing a burn-in run to get the reference hot in memory. Regarding NUMA, I believe I've seen similar behavior on single CPU machines so I think that's probably not it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants