Skip to content

Conversation

@qurat-ul-ain95
Copy link
Contributor

Summary

These changes preserve raw ColabFold MSA output files (.a3m files) if the user disables cleanup_msa_dir. Previously, the raw directory containing batch MSA files was always deleted, regardless of the cleanup_msa_dir setting, preventing users from accessing the raw colabfold MSA data for inspection or reuse.

The pipeline sends batched requests to Colabfold containing multiple sequences. Suggested change in this PR separates out the resulting files by the unique sequence identifier rep_id and stores them in a new raw_colabfold_output folder which persists between runs.

Changes

  1. _parse_a3m_file_by_m() function: util function to parse batch A3M files and extract individual MSA sections by M value (ColabFold's internal sequence identifier). This function handles:

    • Null byte delimiters (\x00) that separate M sections in batch files
    • UniRef headers vs. M value headers
    • Optional filtering by M values
  2. _organize_raw_main_outputs_by_query() method: Extracts individual MSA sections from batch A3M files and saves them in a persistent directory raw_colabfold_output:

    • Directory structure: raw_colabfold_output/{rep_id}/{filename}.a3m
    • Files preserved: uniref.a3m and bfd.mgnify30.metaeuk30.smag30.a3m for each rep_id
    • Called after: Main MSA processing completes, before cleanup
  3. class TestParseA3mFileByM contains a few pytests to check _parse_a3m_file_by_m() functionality.

Closes #31

Copy link
Contributor

@jnwei jnwei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @qurat-ul-ain95

First, thank you for the detailed investigations and suggested improvements to the colabfold pipeline. We really value your contributions to making the the colabfold msa pipeline more reusable for other applications.

Overall, the PR looks great with very thorough test examples and good documentation.

I have one general question: It looks like the goal of these changes is to save the alignments as individual a3m files / query. The end goal would be to reuse these alignments for future OpenFold3 predictions, or as alignments for other applications. Is this correct?

If that is the case, I believe the following runner.yml settings will save alignments as a3m files / sequence, labeled by the rep_id, which is the hash of the query sequence.

msa_computation_settings:
  msa_file_format: a3m  # npz by default
  cleanup_msa_dir: false
  msa_output_directory: /path/to/msas

The code that handles processing of colabfold MSAs into a3m files can be found here. I think this has some similar functionality to the added _organize_raw_main_outputs_by_query?

I do like the refactoring in this PR and I think it makes sense to use these changes in some of the colabfold parsing functions. @gnikolenyi what do you think?

def test_parse_multiple_m_values(self, tmp_path):
"""Test parsing a3m file with multiple M values separated by null bytes."""
# Real a3m files have null bytes (\x00) before new M value headers
a3m_content = ">101\nSEQUENCE1\n>UniRef100_A0A123\nMATCH1\n\x00>102\nSEQUENCE2\n>UniRef100_B0B456\nMATCH2\n\x00>103\nSEQUENCE3\n>UniRef100_C0C789\nMATCH3\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider adding textwrap.dedent here and in other test examples for easier readability

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

raw colabfold outputs always deleted even if cleanup_msa_dir is disabled

2 participants