Skip to content

Bulk Download of MSA Files from AlphaFold DB #1111

@hughplay

Description

@hughplay

Description:

I'm trying to download Multiple Sequence Alignment (MSA) files for several million proteins from the AlphaFold database (https://alphafold.ebi.ac.uk/).

For individual proteins, I can successfully download MSA files through the "Download files" section on the protein entry pages (e.g., https://alphafold.ebi.ac.uk/files/msa/AF-G1JSI4-F1-msa_v6.a3m). However, I need to download MSA files at scale using a list of protein IDs.
I've explored several options but encountered limitations:

  • Direct API-style downloads​ using the individual protein links - This appears to work for single files, but I'm concerned about potential rate limiting when scaling to millions of requests. I couldn't find documentation about API rate limits or bulk download policies.
  • Google Cloud bucket​ - The available data appears to be limited to version v4 and doesn't include MSA files.
  • EBI FTP server​ (https://ftp.ebi.ac.uk/pub/databases/alphafold/) - While the changelog mentions MSA updates, I couldn't locate the actual MSA files in the directory structure.

Questions:

  • What is the recommended approach for bulk downloading MSA files given a list of protein IDs?
  • Are there any rate limits or best practices I should follow when making large numbers of requests to the individual download endpoints?

Thank you for your assistance and for maintaining this valuable resource!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions