According to Womble, the supercomputer manufactures are having a hard time. High-performance computing has a limited market and is impacted by the shortage of people who can work in the industry. High-performance computing is competing over these poeple with the entertainment industry and other Silicon Valley companies, and most of the time, it loses.
In fact there are not that many supercomputer vendors left in the US: only Compaq, IBM and SGI. There is only a small market, and they depend heavily on the ASCI programme to develop and deliver the top-of-the-line machines.
One of the design problem for supercomputers with thousands of processors, is that you cannot scale a systems based on workstations. The processors can be replicated, but the rest of the system, including the memory structure, communications and system software requires special attention to scalability. You need to take care of that.
Scalability can be done in almost any case, but you have to start at the top and design scalable systems with thousands of processors, paying special attention to the memory structure. When that works, you can scale them down to the number of processors you need or can afford.
At Sandia, Womble is working on providing computing capabilities to researchers. A single very large application should be able to run on their largest supercomputer. Applications that should be able to use the entire supercomputer efficiently are just to big to run on other computers. This requires a special architecture that must be scalable and in balance.
The Sandia Asci Red machine, has for instance 800 Mbyte/s communication band width per node, and 333 Mflop/s computing performance per node. A factor 1 to 3 seems to be a good balance. Although the exact balance is application- dependent, a ration of 1:3 seems reasonable.
In clusters of workstations you often have the reverse: three times as much performance as bandwidth. This limits the scalability to small numbers of processors.
Another scalability bottleneck is the operating system. You simply cannot have and in fact do not need a full operating system on each node. Most functions cannot be used and only give rise to possible errors. And with thousands of processors, even the slightest chance of an error in a node will surely bring your machine to a halt every day, or perhaps every few hours. At Sandia they developed and installed a light weight kernel called Puma that can be installed on the computational nodes of a scalable MPP and that does scale.
The MPP's built in this way, are really special purpose machines. They provide capability computing, For capacity or through-put, you can use clusters of work-stations
As Womble points out, scaleable systems is not enough, you also need scaleable algorithms. He is looking for algorithms for which the scaling of the problem is linear, i.e. scale with problem size n. Order nlogn is also acceptable, because of the higher accuracy that may be needed.
Vendors provide support in the operating system and compilers for small-scale parallelism. Although that sounds nice, it actually prevents scalability into larger numbers. "Compilers for instance, look at loops, not at the physics" says Womble." Vendors do not provide support in their system, even if the processors and the algorithms scale.".
When developing new code, it is easy to built-in scalability. It most of the time naturally follows the physics. When distributing an application over nodes, the most important decision is were to locate your data. Data locality is the single most important factor. Follow the physics, and you usually make the right decisions. The algorithms are often not that difficult.
For legacy codes this does not work, of course, There it will take several redesign cycles before scalability can be put to work. Owners of legacy codes should look, each time they add some features whether they can rewrite parts to make them more scaleable. So for legacy codes the whole process towards scalability is evolutionary.
At Sandia, they are also looking at Linux clusters. Their cluster is called Cplant and uses a Myrinet interconnection. One of the problems with clusters, and also with Linux clusters, is that the hardware does not scale. The interconnect bandwidth is still too slow. Even the new QSW network hardware has a 3:1 calculation to communication ratio. Additional work will be required before these clusters scale to large numbers of processors.
Nevertheless it is interesting to look at these clusters for several reasons. A nice thing about Linux for instance, is that it is open. You can get the source and even modify it when needed; and for building large clusters, it is needed. With the current state of technology, system software and the interconnect cannot be commodity for large scale computing. Although commodity computing is cheap, it does not provide the necessary levels of reliability. A scaleable MPP machine is not a collection of workstations.
The scientific computing market is small. Vendors just cannot spend too much on support there. Linux offers chances because of its openness and portability, For instance, the government labs in the USA try to come up with a standard Linux extension for parallel processing. For Sandia this means they can use an optimised and modified Linux kernel on there Cplant machine.
Overlooking the developments in the field, Womble comes to the conclusion: "Parallel processing is here to stay".