This repository was archived by the owner on Aug 29, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdriver.py
393 lines (296 loc) · 11.7 KB
/
driver.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
#!/usr/bin/python
#
#
# Driver script for biomolecular simulations structures setup pipeline
#
#
class Sequence():
"""
Describes the sequence of a molecule to include in the system
being prepared.
Sequence is an abstract base class. Specialised classes implement
'organic', 'protein', 'dna', 'rna' sequences.
"""
def __init__:
pass
class Structure():
"""Describes the coordinates, topology, residue and atom identifiers present
in a molecule structure.
Structure is an abstract base class. Specialised classes are
used for 'organic', 'protein', 'dna', 'rna' structures.
"""
def __init__:
pass
def coordinates:
"""Returns the coordinates of the structure"""
pass
def connectivity:
"""Returns the bond/bond-orders of the structure"""
def sequence:
"""Returns the sequence described by the structure"""
pass
class Model():
"""
Describe the sequence and structure of a molecule to include in
the system being prepared.
Model is an abstract base class. Specialised classes are
used for 'organic', 'protein', 'dna', 'rna' models.
Remark:
clarify differences between Model and Structure classes
"""
def __init__:
pass
def loadSequences([]):
"""
Input:
A list of strings that contains paths to input files
Action:
Load sequences contained in list of input files,
Code has to handle various sequence input formats
Initially we can rely on file extensions.
We store different types of sequences in different categories
e.g. first fasta file read -->
if it contains a protein, store in proteinsequence1
if it contains dna, store in dnasequence1
first smiles file read --> organicsequence1
We should also do some sanity checks (is this a valid SMILES etc..?)
classify sequence type and create sequence objects
Allowed sequence types are
organicX --> described by SMILES
proteinX --> described by IUPAC/IAB one-letter amino acid code
dnaX --> described by IUPAC/IAB one-letter base-pair code
rnaX --> described by IUPAC/IAB one-letter base-pair code
Output:
A sequences dictionnary e.g.
sequences = { "protein1" : ProteinSequenceObject ,
"organic1": OrganicSequenceObject }"""
return {}
def loadStructures([]):
"""
Input:
A list of strings that contains paths to input files
We must support a variety of file formats (PDB, mol2, ...)
Action:
Load structures contained in list of input files.
Classify structure types and create structure objects
Remark:
* structural waters, ions .... handled as 'organic' structures
Output:
A structures dictionnary e.g.
structures ={ "protein1" : ProteinStructureObject,
"protein2" : ProteinStructureObject,
"organic1" : OrganicStructureObject }
"""
return {}
def mapProteinSequences(sequences, structures):
"""Input:
A system dictionnary
Action:
Find all passed protein sequences
For each protein sequence
- create a new proteinmodel variable (ProteinStructureClass)
perform pairwise sequence alignment to sequences present in all
protein structures passed
If exact match (100% identity)
use coordinates of protein structure
if sequence is fully contained within template structure (substructure):
use coordinates of protein structure
if partial match (similarity > minimum threshold)
pick template protein sequence with highest similarity
use homology modelling to construct structures of query sequence
if no match or similarity < minimum threshold
BLAST sequence to find possible templates (need access to database)
Remark:
if no match, it may be challenging to model quarternary structures so
should consider bailing out if more than one protein sequence needs a
structure for this setup
Some protein structures may have no matching sequences. Default behavior
is to ignore structures.
Output:
A list of proteinmodelX entries
each proteinmodelX variable is a ProteinModelobject
"""
return []
def reviewProteinModels([]):
"""
Input:
A list of protein models
Action:
For each model, create a scene to render the structure of the model
Allow user to edit structure coordinates
Output:
A list of protein models
"""
return []
def mapOrganicSequences(sequences, structures):
"""
Input:
A list of sequences and structures objects
Action:
For each organic sequence in the passed sequences
create an organicmodel object
Do a fingerprints similarity calculation against all
sequences in organic structures
full match (Tanimoto 1):
use coordinates of matched sequence for model coordinates.
partial match (Tanimoto > threshold):
perform unsupervised alignment onto coordinates of matched sequence.
no match:
generate 3D coordinates from SMILES, activate flag for random
position/orientation upon embedding
Output:
A list of organicmodelX entries.
each organicmodelX variable is a OrganicModelObject
"""
return []
def reviewOrganicModels([]):
"""
Input:
A list of organic models
Action:
For each model, create a scene to render the structure of the model
Allow user to edit structure coordinates
Output:
A list of organic models
"""
return []
def embedModels(proteinmodels, organicmodels):
"""
Input:
A list of proteinmodel and organicmodel objects
Action:
Create an empty 'space' container
Add every protein and organic models that have known
'absolute' coordinates to that container
Define system bounding box
Compute bounding box of each positioned protein and organic
objects
For each remaining model lacking absolute coordinates
Compute bounding box of model
Keep randomly inserting model in a random orientation
into system bounding box, until no clashes between molecules
bounding boxes.
Remark:
Allow flexibility in bounding box geometry
Output:
A system dictionnary that contains a list of protein and organic models,
and information about the bounding box geometry
"""
return {}
def solvate(system, environment):
"""
Input:
A system dictionnary that contains models within a bounding box
An environment dictionary that defines the (co)solvents
Action:
Work out number of each cosolvent molecule to insert in system bounding box
to achieve concentrations specified by environment
Here need alg to position cosolvent molecules without significant clashes
Remark:
- We get more realistic distributions by replicating
pre-equilibrated solvent boxes
Output:
A system dictionnary with organic models for each cosolvent molecule
"""
return system
def adjustProtons(system, pH=7.0):
"""
Input:
A system dictionary containing models of molecules that
have titrable sites
A pH value
Action:
Algorithm that determines protonation states from structures
Output:
A system dictionary with updated titrable sites
"""
return system
def writeOutput(system):
"""Input:
A system dictionnary
Action:
Write the contents of the system to output file formats.
We need to output
- Molecules (connectivity, coordinates, bond orders, elements/atom types)
- Box dimensions
"""
if __name__ == '__main__':
###################################################################
# User input data
# We assume the user knows which sequences & molecules
# they want to structural models for !
# Here we pass fasta files to describe protein sequences and
# canonical smiles for organic molecules
# The implicit assumption is that one sequence = one molecule
sequenceFiles = ["humanthrombin-lightchain.fasta",
"humanthrombin-heavychain.fasta",
"3J.smi"]
# The user also specifies the environment that will surround
# the structures derived from the input sequences
# Arbitrarily complex buffers are described by adding
# multipled cosolvent entries
# We use disconnected smiles to group together components
# into electroneutral groups
# * For colsolvents that contain titrable residues (e.g. Tris below)
# we will have to come up with suitable heuristics to generate
# electroneutral boxes
# One drawback of describing water as a cosolvent is that we will not
# have geometries matching exactly a given rigid body water model that
# we may wish to use later on for a simulation.
# We are going to say for now that this is a problem for rigid body water
# forcefields, and deal with that later down the pipeline
# Note setting a pH value is meaningful for aqueous solutions only
environment = {'pH':7.4,
'cosolvent1':('O','55 M')
'cosolvent2':('[Na+].[Cl-]','50 mM'),
'cosolvent3':('OCC(N)(CO)CO','20 mM') }
# The user also specifies any structural data that should be used
# by the pipeline to generate models for the input sequences.
# Here we load a single PDB file
structureFiles = ["2ZC9.pdb"]
###################################################################
###################### Pipeline begins ############################
###################################################################
#
# The following steps are core to the pipeline
#
# We load the input sequences
sequences = loadSequences(sequenceFiles)
# We load the input structures
structures = loadStructures(structureFiles)
# Depending on the input the next steps may not be always executed.
# Multiple different implementations of each action may become available.
# Suggests need a different design than top-level module function for
# the pipeline 'Actions'
# We map protein sequences onto protein structures to make protein models
proteinmodels = mapProteinSequences(sequences, structures)
# * separate procedures could be implemented for DNA/RNA sequences
# Optional human interaction to review generated models, select
# one conformation among ambiguous solutions, or manually edit
# structures
# commented out as not needed for minimal implementation
# proteinmodels = ReviewProteinModels(proteinmodels)
# We map organic sequences onto organic structures
organicmodels = mapOrganicSequences(sequences, structures)
# Optional human interaction to review generated models, select
# one conformation among ambiguous solutions, or manually edit
# structures
# commented out as not needed for minimal implementation
# organicmodels = ReviewOrganicModels(organicmodels)
# We embedd all proteinmodels and organicmodels in a system
system = embedModels(proteinmodels, organicmodels)
# We solvate the system
system = solvate(system, environment)
# We optimise protonation states
# commented out as not needed for minimal implementation
# system = adjustProtons(system, pH=environment['pH'])
# Optional human interaction to review ambiguous protonation states
#
# The following step is core to the pipeline
#
# We write this information in files suitable as input
# to a simulation engine input prep utility
writeOutput(system)
###################################################################
###################### Pipeline ends ##############################
###################################################################