|
NERSC, the US National Energy Research Scientific Computing Center, is one of the large research centres in the USA. It houses the number 5 supercomputer in the world and serves the whole US Department of Energy research community with a focus on large scale computing.
Simon explained the Earth Simulator should be a real catalyst for fundamental change in US science policy. It is not a single fast machine but the commitment of the Japanese government to invest in science-driven computing. As Kahaner explained in his talk at the same conference, Japanese are indeed focusing on science and applications, not on computer architecture and systems per se.
But what is very important, according to Simon, is to realise that the Earth Simulator is not a special purpose machine, hence, all US scientific computing communities are potentially now at a handicap of 10 to 100 in available computing capability. The importance of the Earth Simulator is not in its impressive peak performance, but in the sustained performance on real scientific problems. That allows Japanese scientific policy to build strategic partnerships in climate, nanoscience and fusion, moving to dominate simulation and modelling in many disciplines and just in climate modelling as the name Earth Simulator could imply.
The lesson learned here, said Simon, is that to optimise architectures for scientific computing, it is necessary to establish feedback between scientific applications and computer design over multiple generations of machines.
Designing new supercomputers and testing new architectures is out of fashion in the US. In the early 1990s Simon counted some 50 supercomputing relevant projects at American universities. Now, only a few are left. Virtually no one is looking at parallel languages and tools these days. Grid middleware and tools are getting all of the attention and resources.
As an example, Simon mentioned the latency numbers for interconnects. What is currently available is worse than that of the T3E in 1997! The bandwidth has hardly improved either.
The Power-4 chip for the new supercomputer at NERSC also reveals what Simon dubs the "Divergence problem". The chip does only perform 7% better than the Power-3 on scientific applications. Problem is the chip was not designed for scientific applications and the requirements for commercial and scientific computing are diverging. Problems are that the memory latency did not improve between the generations and the Power-4 does not scale well for more than 16 scientific tasks.
In Simon's view we have pursued the logical extreme of the commodity parts path. In the beginning of off-the-shelf computing, the commodity building block was the microprocessor - today it is an entire SMP server. Communications and memory bandwidth did not scale with processor power. And the size of systems is ever increasing: we have arrived at near football-field size computers consuming megawatts of electricity.
Simon did stress again that the requirements of high performance computing for science and engineering and the requirements of the commercial market are diverging. The commercial cluster of SMP approach does not provide the highest level of performance due to:
- Lack of memory bandwidth
- High interconnect latency
- Lack of interconnect bandwidth
- Lack of high performance parallel I/O
- High cost of ownership for large scale systems
How should the problem be solved? Not the same way as in the 1990s with a lot of companies sparked by DARPA. The world has changed too much. But there is still a significant scientific market for high performance computing also outside of supercomputer centres. "For this new environment, we need a new, sustainable strategy for the future of scientific computing", Simon stated.
Ingredients of this new strategy are application teams to design new architectures; using current components and research prototypes into new architectures and continuous redesign and test prototypes in a vendor partnership to create new scientific computers. Development co-operations should bring together the interested computer vendors and scientists. As a result a feedback cycle should result between science and computer design lasting for generations of machines.
A first example of such a co-operative development is "Red Storm". Sandia National Laboratory and Cray are the partners of the project. Red Storm is an MIMD parallel supercomputer, a true MPP design with distributed memory. The machine will have 108 compute nodes with 10,368AMD compute node processors with a 20 Tflop/s peak. It will consume less than 2 MW of power.
Another example is Blue Planet, a collaboration of Simon's NERSC, ANL and IBM. The design goals of Blue Planet are based on scientific applications. The design will extend the Power technology and include "Virtual Vector Processing". Blue Planet is addressing the key barriers to effective scientific computing, including memory bandwidth and latency, interconnect bandwidth and latency, and programmability for scientific applications.
For IBM this is a divergence from their former strategy. For instance, Virtual Vectors were not planned as was decreasing switch latency. Blue Planet also requires a radical redesign of the company's software stack, Simon said. Simon concluded by stressing again that business as usual will not preserve US leadership in advanced scientific computing. New computer architectures optimised for scientific computing are critical to enable 21st century science. In his view, US science requires a strategy to create cost-effective, science-driven computer architectures. |