Michael Levitt's Research Program (Jan 2000)
Michael Levitt's current research program reflects his wide range of interests and his focus on predicting protein structure by physical, computational and biological means.
Physical Simulation remains a key interest as it directly connects a complex biological system like a protein molecule to the simple underlying laws of physics and chemistry. It continues to amaze that simple pairwise interatomic forces combined with Newton's equations of motion are able to predict the complex behavior of liquid water [83], mixed solutions [87], peptides [59], proteins [45] and nucleic acids [84]. Simulation of molecular behavior requires significant computer resources but these have increased so rapidly that state-of-the-art calculations can now be done on inexpensive machines. A key lesson learnt from simulations is that many simple components interacting through very simple laws can give rise to very complicated behavior. This is true for the interactions of water molecules, which give rise to the complexities of liquid water, and also for the interactions of amino acids, which give rise to the complexities of structural and catalytic proteins. There is no reason to believe that we have reached any size limitation in the use of physical simulation. As more computer resources become available and we better understand how to compare the results of simulation with experiment, we will be able to simulate increasingly complex systems such as large proteins, protein/nucleic acid complexes, and even the ribosome with about half a million atoms.
On a more down to earth level, group work in this direction involves a third of our effort and aims to answer the following specific questions:
1. Can the aggregation of small non-polar solutes in water be quantified in terms of hydrophobic free-energy and its dependence on contact area? We use molecular dynamics [87] to simulate different concentrations of different non-polar molecules in water to determine the energetics of clustering .
2. Can we unfold and refold b-hairpin and a-helix folding units? Recent studies by a collaborator, Dr. Eaton at NIH, show that the C-terminal b-hairpin of protein G refolds on a microsecond time-scale. We are simulating this with our very efficient methods, which include explicit waters.
3. Can we refold small proteins that are found to refold easily experimentally? These proteins are characterized by a simple chain topology, which lends itself to fast folding, for example, Protein G and the SH3 domain [98]. We estimate that this refolding will require hundreds of nanoseconds of simulation (over 100,000,000 discrete timesteps each involving the calculation of all the forces on all the atoms) and have gained access to additional computational resources through collaborations at the Lawrence Livermore National Laboratory.
Heuristic Structure Prediction is the natural complement to physical simulation and it is surprising that there are so few practitioners of both. In structure prediction, the aim is to calculate the conformation of the molecular system that is biologically relevant; we are not interested in the folding pathway, the rates of conformational transitions or dynamical fluctuations. The crucial assumption is that the biologically relevant (or native) state is special and can be recognized against a background of many non-native states. This is true for proteins whose native three-dimensional structure "look" so beautifully organized with many well-formed hydrogen bonds nicely balanced by hydrophobic interactions. In our approach to this problem, we rely heavily on methods developed by applied mathematics and computer science such as exhaustive enumeration, numerical optimization and mean-field approximation.
The basic approach [80] is simple and consists of two separable steps: (1) Generate a large number of protein folds that hopefully includes some structures that are close to the native protein structure; (2) Use a discrimination function, such as the free energy, to pick out those conformations that are most like the native state. The approach has two obvious shortcomings. First, there is an astronomically large number of arrangements of the polypeptide chain and, second, the free energy is a notoriously difficult quantity to calculate accurately. We get around both problems by choosing an appropriate level of simplification. The representation of a protein is simplified so that the number of possible arrangements is manageable (about a billion for today's computers). The discrimination functions are derived empirically from the interactions commonly observed in known three-dimensional structures. More specifically, we aim to answer the following questions:
1. Can we generate non-native decoy folds of proteins that are so well-packed that they look like real native structure? Our current approach is based on our previous work with lattice [60] and off-lattice models [73], side-chain placement [54,61] and energy refinement [1,6]; it allows us to generate tens of thousands of all-atom structures using just the amino acid sequence.
2. Can energy functions derived from selected sets of protein structures correctly distinguish near-native structures from decoys? We are continuing to optimize knowledge-based energy functions by testing them on our existing decoy sets and then improving decoy generation to make such discrimination more difficult. This feedback has induced healthy competition within the group.
3. Can we fold some small proteins using just the amino acid sequence? We have combined predicted secondary structures with decoy generation and knowledge-based potentials in on-going attempts to correctly predict folded structures before they are solved experimentally. More recently we predicted structures for all the eleven CASP3 (Critical Assessment of Structure Prediction 3rd Meeting) sequences that showed no homology to any known protein structure (ab initio prediction) and are very encouraged by the results [97].
Besides working on ab initio structure prediction, the heuristic nature of our approach also includes modeling proteins that are homologous to another protein of known three-dimensional structure. Once the sequence to be modeled is aligned to the template structure, our method, known as SegMod [61], that will reliably model the unknown structure. This method, also tested in the CASP3 blind structure prediction contest, gave some of the best geometries for internal side chains [96]. In the area of homology structure prediction, we aim to answer the following specific questions:
1. How can we recognize the best template from which to build a model? By modeling large numbers of globin and immunoglobulin structures we are testing the accuracy of the methods.
2. How are the best loops generated? Even when proteins are very similar in sequence, there are short regions that can differ significantly. These segments are generally loops and we are testing a number of different loop modeling algorithms on the H3 hypervariable loop of antibodies as well as on the variable loops of the HIV gp120 protein. In many ways, loop modeling is like simplified ab initio folding and fits well into our general scheme.
Structural Bio-Informatics is the third part of our current research and is characterized by comparisons of both sequences and structures, classification into databases and presentation of the results via sophisticated web sites. This is the newest aspect of my work and I have been helped by close collaborations with two gifted post-doctoral associates in my group: Dr. Steven Brenner, author of the SCOP protein structure database and Dr. Golan Yona, author of the ProtoMap protein sequence database. I have also learned a great deal through my consulting relationship with Molecular Applications Group (a company I founded in 1990) with its expertise in database management systems, detection of distance homology and analysis of gene expression data. Specific current projects include:
1. The Presage database [95], designed with Brenner, aids structural genomics by acting as a central repository for experimental and theoretical work on the structure of proteins produced by cloning genomic DNA. It provides a useful community service and is supported by a distinguished Scientific Advisory Board (Amos Bairoch, Geneva; Helen Berman, Rutgers; Tom Blundell, Cambridge; Sung-Hou Kim, Berkeley; Andrej Sali, Rockefeller; and Yokohama, Tokyo).
2. The more recent UniMap database, designed with Yona, is a comprehensive map of protein sequences and structures. The first release (due 1 Aug 99) includes 168,431 protein sequences clustered into 1,421 families each based on a different protein domain of known three-dimensional structure. We are currently building three-dimensional structures for all proteins that can be modeled reliably (about 100,000 all atom-models).