-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak in GenotypeGVCFs with -all-sites
#8989
Comments
Can you reduce the maximum number of alleles per site when you run this analysis? |
Hi @brisk022 There is an update for this issue. We were able to recreate this problem in our hands and looks like there is a memory management issue somewhere in the GenomicsDB related code inside GenotypeGVCFs. Our temporary solution until we make an updated release would be to convert imported genomicsDB instances to GVCF using
and later using this GVCF file as input for the GenotypeGVCFs tool. This ensures that memory usage won't go above unreasonable levels and won't cause any appearant leaks. I hope this helps. Regards. |
@nalinigans: We now believe this is actually a GenomicsDB issue (or possibly an issue in the JNI layer.) @gokalpcelik was able to reproduce this problem on a set of 330 whole exomes. He found that if he ran GenotypeGenotypeGVCs from a GenomicsDB the memory usage climbed up to 10s of GB, but the java heap memory remained constant. He then tested firt extracting the combined GVCF from genomics db and then running GenotypeGVCFs and saw that memory usage for GenotypeGVCFs remained constant at 1 G. So we think this is probably a GenomicsDB issue. GenomicsDBImport > GenotypeGVCFs ---- Memory ramps up immediately to 10s of gigabytes He can fill in more detail about the exact configuration if it helps. |
Between 4.1.0.0 and 4.2.0.0 we moved from GenomicsDB |
@lbergelson @gokalpcelik any chance of giving me access to the workspace for the 330 whole exomes? |
Thanks, @gokalpcelik ! I tested the workaround and indeed when used with a gvcf file rather than GenomicsDB the memory consumption remains reasonable. I only tried GATK 4.6 but it is probably the same with the other versions that have the issue. |
Hi @nalinigans |
Bug Report
Affected tool(s) or class(es)
GenotypeGVCFs with
-all-sites
Affected version(s)
Description
We tried to run GenotypeGVCFs from GATK 4.5 with
-all-sites
on a dataset with 120 samples and GRCh37 as the reference. Each run was limited to a single chromosome. All of them failed after consuming 3 TB of memory. Subsequently, I tried a smaller subset of 8 samples limiting the memory to 32 GB and all the runs failed after 3-10 Mbp depending on the chromosome.Finally, I randomly picked chromosome 9 and used GATK versions from 4.1 to 4.6 and only 4.1 did not experience the problem. It finished the whole chromosome (141 Mbp) with the max memory usage of around 8 GB. All others failed after 3-6 Mbp (Sorry, I used different memory settings for 4.5, so I did not include it.)
Time is in seconds, memory is in MB.
If I run the same command without
-all-sites
, the maximum memory usage is around 1.6 GB.Steps to reproduce
GenomicDB was created using the corresponding GATK version as:
GenotypeGVCFs was run as:
All runs were performed with resource_monitor and it was instructed to kill the process if it consumes more than 14000 MB of memory. Thus, at least 2 GB was allocated for reading GenomicsDB. The size of the GenomicsDB on disk is around 3.1 GB for versions >=4.2 and 3.0 GB for version 4.1.
The text was updated successfully, but these errors were encountered: