Preserve Raw ColabFold MSA Output Files #35

qurat-ul-ain95 · 2025-11-12T10:26:22Z

Summary

These changes preserve raw ColabFold MSA output files (.a3m files) if the user disables cleanup_msa_dir. Previously, the raw directory containing batch MSA files was always deleted, regardless of the cleanup_msa_dir setting, preventing users from accessing the raw colabfold MSA data for inspection or reuse.

The pipeline sends batched requests to Colabfold containing multiple sequences. Suggested change in this PR separates out the resulting files by the unique sequence identifier rep_id and stores them in a new raw_colabfold_output folder which persists between runs.

Changes

_parse_a3m_file_by_m() function: util function to parse batch A3M files and extract individual MSA sections by M value (ColabFold's internal sequence identifier). This function handles:
- Null byte delimiters (\x00) that separate M sections in batch files
- UniRef headers vs. M value headers
- Optional filtering by M values
_organize_raw_main_outputs_by_query() method: Extracts individual MSA sections from batch A3M files and saves them in a persistent directory raw_colabfold_output:
- Directory structure: raw_colabfold_output/{rep_id}/{filename}.a3m
- Files preserved: uniref.a3m and bfd.mgnify30.metaeuk30.smag30.a3m for each rep_id
- Called after: Main MSA processing completes, before cleanup
class TestParseA3mFileByM contains a few pytests to check _parse_a3m_file_by_m() functionality.

Closes #31

jnwei

Hi @qurat-ul-ain95

First, thank you for the detailed investigations and suggested improvements to the colabfold pipeline. We really value your contributions to making the the colabfold msa pipeline more reusable for other applications.

Overall, the PR looks great with very thorough test examples and good documentation.

I have one general question: It looks like the goal of these changes is to save the alignments as individual a3m files / query. The end goal would be to reuse these alignments for future OpenFold3 predictions, or as alignments for other applications. Is this correct?

If that is the case, I believe the following runner.yml settings will save alignments as a3m files / sequence, labeled by the rep_id, which is the hash of the query sequence.

msa_computation_settings:
  msa_file_format: a3m  # npz by default
  cleanup_msa_dir: false
  msa_output_directory: /path/to/msas

The code that handles processing of colabfold MSAs into a3m files can be found here. I think this has some similar functionality to the added _organize_raw_main_outputs_by_query?

I do like the refactoring in this PR and I think it makes sense to use these changes in some of the colabfold parsing functions. @gnikolenyi what do you think?

jnwei · 2025-11-14T07:51:28Z

openfold3/tests/test_colabfold_msa.py

+    def test_parse_multiple_m_values(self, tmp_path):
+        """Test parsing a3m file with multiple M values separated by null bytes."""
+        # Real a3m files have null bytes (\x00) before new M value headers
+        a3m_content = ">101\nSEQUENCE1\n>UniRef100_A0A123\nMATCH1\n\x00>102\nSEQUENCE2\n>UniRef100_B0B456\nMATCH2\n\x00>103\nSEQUENCE3\n>UniRef100_C0C789\nMATCH3\n"


nit: Consider adding textwrap.dedent here and in other test examples for easier readability

save raw colabfold msas

8f4a508

jnwei mentioned this pull request Nov 14, 2025

Always delete raw Colabfold folder before processing #39

Open

jnwei reviewed Nov 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve Raw ColabFold MSA Output Files #35

Preserve Raw ColabFold MSA Output Files #35

Uh oh!

qurat-ul-ain95 commented Nov 12, 2025

Uh oh!

jnwei left a comment

Uh oh!

jnwei Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Preserve Raw ColabFold MSA Output Files #35

Are you sure you want to change the base?

Preserve Raw ColabFold MSA Output Files #35

Uh oh!

Conversation

qurat-ul-ain95 commented Nov 12, 2025

Uh oh!

jnwei left a comment

Choose a reason for hiding this comment

Uh oh!

jnwei Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants