-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Open
Description
Description:
I'm trying to download Multiple Sequence Alignment (MSA) files for several million proteins from the AlphaFold database (https://alphafold.ebi.ac.uk/).
For individual proteins, I can successfully download MSA files through the "Download files" section on the protein entry pages (e.g., https://alphafold.ebi.ac.uk/files/msa/AF-G1JSI4-F1-msa_v6.a3m). However, I need to download MSA files at scale using a list of protein IDs.
I've explored several options but encountered limitations:
- Direct API-style downloads using the individual protein links - This appears to work for single files, but I'm concerned about potential rate limiting when scaling to millions of requests. I couldn't find documentation about API rate limits or bulk download policies.
- Google Cloud bucket - The available data appears to be limited to version v4 and doesn't include MSA files.
- EBI FTP server (https://ftp.ebi.ac.uk/pub/databases/alphafold/) - While the changelog mentions MSA updates, I couldn't locate the actual MSA files in the directory structure.
Questions:
- What is the recommended approach for bulk downloading MSA files given a list of protein IDs?
- Are there any rate limits or best practices I should follow when making large numbers of requests to the individual download endpoints?
Thank you for your assistance and for maintaining this valuable resource!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels