Preserve Raw ColabFold MSA Output Files #35
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
These changes preserve raw ColabFold MSA output files (
.a3mfiles) if the user disablescleanup_msa_dir. Previously, therawdirectory containing batch MSA files was always deleted, regardless of thecleanup_msa_dirsetting, preventing users from accessing the raw colabfold MSA data for inspection or reuse.The pipeline sends batched requests to Colabfold containing multiple sequences. Suggested change in this PR separates out the resulting files by the unique sequence identifier
rep_idand stores them in a newraw_colabfold_outputfolder which persists between runs.Changes
_parse_a3m_file_by_m()function: util function to parse batch A3M files and extract individual MSA sections by M value (ColabFold's internal sequence identifier). This function handles:\x00) that separate M sections in batch files_organize_raw_main_outputs_by_query()method: Extracts individual MSA sections from batch A3M files and saves them in a persistent directoryraw_colabfold_output:raw_colabfold_output/{rep_id}/{filename}.a3muniref.a3mandbfd.mgnify30.metaeuk30.smag30.a3mfor eachrep_idclass TestParseA3mFileByMcontains a few pytests to check_parse_a3m_file_by_m()functionality.Closes #31