-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory consumption for extremely large search spaces #97
Comments
Hi George, I agree that it's a necessity for large search spaces. I have been messing around with some internal database splitting, but it's not ready for prime-time yet. In the mean time, it's possible to perform external database splitting - generate slices of FASTA files and run Sage multiple times, then combine the results and rescore. Perhaps not ideal, but this is essentially what would be done with internal database splitting as well. See below for an example python script for accomplishing this. import subprocess
import pandas as pd
from Bio import SeqIO
SLICES = 5
records = []
for record in SeqIO.parse("fasta/human_contaminant.fasta", format="fasta"):
records.append(record)
N = len(records) // SLICES
for i in range(SLICES):
with open(f"fasta/human_slice_{i}.fasta", "w") as f:
for record in records[i * N : (i + 1) * N]:
SeqIO.write(record, f, format="fasta")
cmd = [
"sage",
"search.json",
"-o",
f"semi_{i}",
"-f",
f"fasta/human_slice_{i}.fasta",
"--write-pin",
"HeLa_chytry_HCD_1.mzML.gz",
]
subprocess.run(cmd)
dfs = []
for i in range(SLICES):
dfs.append(pd.read_csv(f"semi_{i}/results.sage.pin", sep="\t"))
pd.concat(dfs).sort_values(by="ln(hyperscore)", ascending=False).drop_duplicates(
subset=["FileName", "ScanNr"], keep="first"
).to_csv("sliced.pin", sep="\t") |
Hey, just a question related to this issue. Could it be that by sorting on the hyperscore and only retaining the best match you might lose hits (also not compatible with chimera searching)? Would a valid alternative be to split up the searches in terms of precursor m/z in consecutive searches (but those on whole FASTA instead of FASTA splitting)? I once tried it (by making alternative JSONs in a loop) but the memory consumption did not decrease. |
Thanks for the feedback! We have been using similar workarounds before. FragPipe also uses similar mechanisms for very large databases. |
Interesting that this didn't decrease memory consumption - setting the That is a valid point about chimeric searches, but those are already kind of heuristic (subtractive method vs something potentially smarter). One potential alternative would be to pre-digest the FASTA database (and pass in "$" as the cleavage enzyme to Sage), and then chunk the FASTA database by peptide mass. That should help with improving chimeric searches and possibly make it go faster as well - this is basically what would be implemented internally. |
Will splitting the fasta file not lead to inconsistencies when defining the protein groups? Groups will only be defined within a fasta slice/search, and proteins A from slice 1 and protein B from slice 2 that share peptides won't get grouped together. Or will the groups somehow be updated after concatenation? |
Hi @lazear, I was wondering if anyone has started working on implementing database splitting into Sage itself. If not, we might take a look at it. Best, |
Draft implementation which works reasonably well: #154 |
After iterating over the chunks, a final database is created from the found peptides as if it was a normal digest. As such, peptides from proteins in individual chunks are actually correctly merged. |
Hi Michael,
When using Sage with very large search spaces (e.g. many PTMs, non-specific digestion, etc.), the memory consumption frequently goes beyond the available resources on standard workstations. In such scenarios, MSFragger partitions the search space and iteratively processes them.
I was wondering whether similar functionality would be possible to implement in Sage. For example, a "batch size" parameter could be manually set (or estimated based on available memory) to partition the search space. I think there are several options on how this could be implemented, one option could be to partition candidate peptide precursors based on precursor m/z and have different partitions for different spaces. For DIA, this could correspond to the precursor isolation windows, for DDA, it might make sense to just select the range according to batch size. The main search algorithm could then iterate over the partitions for scoring and the individual partitions would be assembled before ML and statistical validation. The search space could be generated according to partitions on-the-fly and kept in memory, or alternatively, also exported to disk (similar to how MSFragger does it).
How do you think about this options? Would there be a preferred solution?
Best regards,
George
The text was updated successfully, but these errors were encountered: