Notes on typical setup pipelines

Based on an initial conversation at MolSSI 8 Oct 2016.

missing loops
incomplete residues
cofactors: keep or discard? where to get parameters?
ions: keep or substitute?
crystal waters: keep or throw away?
crystal contacts, domain swapping
Read PDB paper to ensure that this is the protein structure that he wants to use
Note that assay conditions may differ from crystallographic conditions

Decide which structure you want to use
Decide which chain to use if multiple copies
Reverting mutations or simulate a different construct
Disulfide if not in a reducing environment
Address PTMs
Julien typically uses the Maestro Protein Prep Wizard to:
- add missing loops (up to a certain length)
- add N/C-termini? Most people omit these
- assign protonation states for desired pH
- keep crystal waters; add hydrogens
- interactively check histidine
Structural metal ions (e.g. Zn2+, Ca2+):
- decide whether to retain
- substitute with multisite models (alternatives: covalently bonded (harmonically restrained); single-site LJ)
Ligands and cofactors:
- pick protonation state / tautomer
- find or create parameters
- covalently bound cofactors?
- Consult Uppsala EDS to verify that ligand density justifies binding mode
- model in rest of ligand or replace the ligand with another one (CCSD? swap from other PDB file? OpenEye)
Protonation states?
- (PROPKA? 3.1 can do ligands; MCCE2?) Counterions and solvent
- can do in either order
- how big should box be? what shape? what buffer should be used? (Peter Kasson uses 20A buffer; Julien uses 12A; Oliver uses 15A)
- for membrane proteins, at least 3-4 layers of lipids sideways; z-axis is very tricky
- ionic strength

Participants:

Priorities over dependencies:

Decide on source structure Data
Input: Sequence(s) / biological units /ligands assay conditions
- Important factors: Resolution, missing loops, bound ligands, sequence identity, conformation/diversity, structural bio techniques
- Solution idea : construct explorer: uniprod + domains + splice mutants+ more domain knowledge e.g. python dictionary: {'PTMS 3 letter code': 1 protein, 'c1cccccc1' 1 ligand, 'nacl': 20 mM, 'Tris': 20 MM, 'pH' : 8.0}

Additional information needed would be e.g. Ligand expd tlc, Prot/tautomers or any new chemistry

Input should be generated automatically and could take the format of a topology-like object , or nested lists.

*Building blocks:

Clean API
Best practices, i.e. fully automated pipeline, e.g. using XML style input.
Questions: Should decisions based on best practices be potenitally allow for interactive intervention? Can a default choice be modified after running though automated setup. What kind of warning, override hints should be allowed? Modularity of different entry levels along the work flow should be allowed.