In the September 1998 issue of Protein Engineering, two biology scientists give an account on the database they created on the 256-processor CRAY T3E of the San Diego Supercomputer Center (SDSC). Researchers are now able to access the new database via the Web in order to compare the 3D structure of over 8000 proteins. Recent experiments have shown that proteins with very different functions and sequences can share a surprising structure similarity. The discovery of such structure neighbours might indicate important clues to an energetically favourable arrangement as often observed in nature or prove the existence of some distant evolutionary relationship between the concerned proteins.
Ilya Shindyalov and Phil Bourne, the two SDSC researchers at the University of California, only needed several weeks to construct the database, using 24.000 processor hours on the CRAY T3E. On a smaller computer, the task would have taken at least one year and a half. The result is a very useful tool for scientists, in order to gain a deeper understanding of all sorts of biological processes. In fact, there are many ways to compare protein structures but the SDSC solution offers calculated comparisons via a Web site that includes alignment and Java based visualization applications to investigate in detail a protein's structure neighbours. Users can even submit structures, not figuring in the database, to obtain comparison results by e-mail.
DNA can be described as a flat blueprint of life whereas the protein structure transforms DNA into a dynamic 3D living organism. The protein has a way of curling and folding itself into a typical shape which accurately identifies the protein functions. At present, we know that the three billion bases of the human genome generate about 70.000 proteins. In the recent past, the scientists had not yet discovered a sufficient amount of protein structures to start comparing and draw relevant conclusions. Today, over 8000 protein structures are available but one can imagine that the process of comparing 8000 proteins against 8000 others constitutes a giant task.
The SDSC database tool is operating through combinatorial extension, a fast method based on local geometry, in contrast with other procedures, which depend on large scale features like secondary structure. If the algorithm is run on a standard computer, the programme takes about 30 seconds to compare one protein polypeptide chain against another one. For a number of 8000 comparisons, this would mean a period of 57 years. By means of typical shortcuts based upon protein sequence, scientists still need about 1.7 years. The new database performs the job within weeks. Every day, a handful of new structures are introduced but these can be updated on a regular computer.
The initial database however could never have been generated without the help of a supercomputer, according to the researchers. Up till now, it was extremely difficult to detect structure similarities from the mere genetic blueprint. The recently discovered 3D structures in the database show common features within highly contrasting protein couples. It seems that such pairs may be related by evolution. Next to the SDSC database, there are still two other protein neighbours databases available, namely the FSSP and VAST tools. The first one stands for Fold classification based on Structure-Structure alignment of Proteins while the latter refers to Vector Alignment Search Tool.
The Biological Data Representation and Query initiative and the Molecular Science trust area of the National Partnership for Advanced Computational Infrastructure (NPACI) are supporting the SDSC database. According to researcher Phil Bourne, all three methods use fundamentally different approaches to analyse the protein 3D structures. Although they produce the same result for a huge number of comparisons, their outcome largely differs on the majority of the weak comparisons. Exactly these cases are the most challenging for more profound research. Thanks to the contest between the different approaches, science is able to advance with enormous leaps. The newly generated SDSC protein database is already available for consultation.