We can begin with a simplified version of the 2D HP model in the square lattice.
-
2D: real proteins fold in 3D. Alternatives (3D):
- simple cubic lattice
- body centered cubic lattice
- face centered cubic lattice
-
Square lattice: real proteins also don't always fold in 90º angles. Alternatives:
- Triangular lattice
This simplified model has only two types of aminoacids, H and P. H stands for hydrophobic and P for polar. As the name states it, the H is afraid of water. In our body, proteins are suspended in water (or mostly water), however, some of the aminoacids must not be in contact with it (they react), while others can. When the protein folds, the H aminoacids must be on the inside, that is why the score is measured by the number of H molecules that are bonded. The more H molecules are bonded, the more grouped are, and the less surface touching the outside, where it can be the nasty water.
-
Elements:
- H: Hydrophobic
- P: Hydrophili (polar)
- H-H edge
- H-P edge
- H-H bond
-
Score: Number of H-H bonds
Maximize the number of H-H bonds . Technically, the H-H edge is also a H-H bond, but they are not mutable, as we would be breaking the structure of the protein, so these are ignored.
- In this first 2D squared lattice, a typical node (except the first and the last) has 2 edge neighbors and 2 other neighbors.
- The optimal folding has the maximum score
- Considdering the string of molecules, only even and odd H's can possibly bond together, (because of the 2D squared lettuce).
- OPT <= 2·min{#even H's, #odd H's}. Explanation: + Every bond defines an even and an odd + Every even and odd can only get hit up to twice + If there is more even than odds its not going to help, as there is + way to use more even than odds (excludig first and last)
This problem is already NP-hard. To find the optimal folding of a given HP string is NP-Hard. First it was proved that 3D HP was NP-Hard, later was proved that 2D also.
- 1/3 - approximations: this approximation is based on the fact that the optimal number of H bonds is OPT <=2·min{#even H's, #odd H's}. The goal is to get 1/3 of OTP. To reach the 1/3, a set of stages are followed. Following them the chance of finding relatively quickli a 1/3 of the OPT is high.
Basic, implementation schema idea storm.
- Iteration 0.1
- A file stores a secuence of H and P letters and other parametters
- Seed parametter, to be able to track random walks
- Maximum number of steps, in order not to get stuck searching the optimum
- The program will check for all the combinations, storing the one with better score
- We will think about representation later
- First implementation will be in python
- Iteration 0.2
- Iteration 0.1 almost completed. Jumping to 0.2
- Fix different usages with arguments
- It will have a benchmark mode, this mode will give the time that has taken to make a given number of folds. To make the bench more even, the number of folds and the protein will be given.
- Iteration 1
- Iteration 1 reached. Functional and it begins to be not that slow
- There is a lot of symetry to study and filter, probably a +60% faster
searching for the absolute solution
- TODO: Write use in README and --help in fold.py
- Fix vertical excess padding when drawing
- There are a lot of symetries easily avoidables. Check out if half of them (only up to k/2+1) are necessary, it could be a massive improvement
- The hydrophobicity of a protein is a parametter heavily researched.
- The natural folding of proteins is a very unknown field, it is supposed that all information of how to fold a protein lies in its aminoacids, so there is no such thing as an Ikea guide that the cell has. So the combinatory problem is so large that it is already not know how it is thone for the cell to acquire the lowest energy configuration for that protein, wich is what makes it functional, (otherwise is considered "dead").
- To know the folding configuration maded naturally by the cells it is needed a process that can take a lot of research, money and time. To throw some light, it is needed to cristalise a proteine (wich can take years of research). From that you get one 3D image that is collected in a protein data bank.
correct output protein string HPHHPHHPPHHH