In: Computer Science
1. Conformation initialization
The starting point (input) of protein structure prediction is the one-dimensional amino acid sequence of target protein and the ending point (output) is the model of three-dimensional structures. The theoretically possible steric conformation for a protein sequence is almost infinite, but the native one for most protein is unique. It is very difficult to fold a protein from its amino acid sequence alone. First, we are still unable to construct a sufficiently accurate force field that can guide the target sequence folding in the right direction; second, the amount of computation involved in such a vast conformational search process can easily go beyond the existing computing ability.
However, there is no guarantee that the satisfactory structural templates for any target protein can always be found. The template-free methods are the best choice for the hard target proteins of which no satisfactory template can be identified. It is the most straightforward way to generate the initial conformation of target protein by random; but in this way the burden of conformational search would be very heavy. Along with the inadequacy of current force field, it’s extremely difficult to accomplish the simulation process with such huge conformational change. In fact, the complex and multilevel nature of protein structure provides us with more choices.
2. Conformational search
After the initial conformation is constructed, we can continue to run simulation with the guide of a certain force field to search for near-native conformations step by step. As a typical biological macromolecule, protein consists of thousands of atoms and its conformational degrees of freedom are huge. Therefore, a simplified representation of protein conformations becomes particularly crucial for speeding up the simulation of protein folding process. In fact, the structural template identified by sequence alignment is already a reduced conformation with only backbone or Cα-atoms, because sequence alignment is actually in residue-level and the matches of different residues make the side chain conformations from template unusable to target protein. Currently almost all protein structure assembly simulation methods do conformational search based on a certain kind of simplified representation. For example, each residue can be represented only by its Cα-atom and the virtual center of side chain, or the entire backbone conformation can be represented by a series of dihedral angles.
3. Structure selection
Following the conformational search, a large number of structures of target protein are generated. One of the unsolved issues in both molecular dynamics simulation and Monte Carlo simulation is that the conformations are often trapped at the local minimal state. Even with the global minimal state identified, the conformation is not necessarily corresponding to the one closest to native state because of the inadequacies of force field. Thus, the common procedure during simulation is to regularly output lower energy intermediate structures for subsequent conformational screening. The key factor of structure selection is the assessment method for distinguishing native-like structures from nonnative ones. There is a specific prediction category in CASP for assessing the methods of structural quality assessment.It should be noted that the methods for structure selection may be designed specifically for assessing the reduced structural models corresponding to the simplified representation adopted during conformational search. It is an important research direction in protein structure prediction to develop methods of structural quality assessment based on all kinds of ideas and techniques
4. All-atom structure reconstruction
Since most of prediction methods adopt simplified protein representation for conformational search, so far what we have obtained are just one or several reduced structural models. The all-atom structure should be reconstructed based on the reduced models. The process of all-atom reconstruction varies a lot for reduced models based on different protein representation. Some prediction methods adopt the representation of “Cα atom” plus “virtual center of side chain”, where the “virtual center of side chain” only acts as an assistant for determining the position of Cα atom during conformational search and the output structure contains only Cα atoms. In that case, the reconstruction process is usually divided into two separate steps. The first step is to rebuild the backbone atoms (C N and O) based on the position of Cα atoms, which is the primary function of many methods developed specifically for all-atom reconstruction, such as SABBAC, BBQ, PULCHRA and REMO.All these methods depend on the backbone fragments cut from experimental structures. For example, the backbone isomer library built by REMO contains 528798 fragments with four consecutive residues which are collected from 2561 protein chains in PDB. The second step is to rebuild the side chain for every residue.
5. Structure refinement
Although the complete structure of the target protein has been obtained by the previous steps, the structural quality is usually not very good, which may owe to the defects of the force field, conformational search or all-atom reconstruction. The process of structure selection by clustering method may also bring some local structural issues if the structures of cluster centroid are used.54 Therefore, it is almost a routine step to further refine the structure after all-atom reconstruction. Since the structural issues in reduced model can directly affect the quality of final allatom structure, some methods combine the procedures of all-atom reconstruction and refinement.55 They refine the reduced model (such as backbone structure) and all-atom structure separately according to the reconstruction schedule.
Structure refinement also requires a force field to conduct molecular dynamics simulation or Monte Carlo simulation, but this procedure is quite different from the previous step of conformational search. The aim of conformational search in structure assembly simulations is to determine the backbone structure of the target protein, which actually sacrifices the structural details to ensure the search efficiency. However, the main purpose of structure refinement is to improve the quality of allatom structure (especially local structure) where only small change is conducted in backbone conformation.
Two Categories of Protein Structure Prediction Methods:
1. Template-based methods
For most target proteins, the desirable structural template can be identified from PDB by sequence alignment or threading method. Since the conformational information from template is much more reliable than that from elsewhere (especially when the target protein and the template are highly homologous), the prediction accuracy of template-based method is generally higher than other methods, which makes it highly popular in practical applications.
2. Template-free methods
Currently most structure prediction methods rely on the information provided by the experimental structures (the most direct way is the use of structural templates), which is not helpful for us to explore and understand the essential law of protein folding. The development of template-free methods is driven not only by the practical application (not all target proteins can find a satisfactory template in PDB), but also by the basic scientific problem of protein folding code. Although the template-free methods commonly exploit the information from known structures as well, their development can better reflect the theoretical and technical level of protein structure prediction than template-based methods.