|
The history of Red Storm
From 1996 until 2001, the world's fastest supercomputer was the ASCI Tflops (ASCI RED) at Sandia National Laboratories. RED was not only fast, it was inexpensive and reliable.
Unfortunately RED was also a one-of-a-kind machine with no follow-on. First Sandia was in discussion with Compaq concerning the Alpha EV7 processor, but it was too expensive. A solution was provided by AMD with the Sledge Hammer. The huge, immediate advantage of Sledge Hammer was that it provided an open-spec., low-latency, and a high-bandwidth interface that could be used for a connection to a custom network. A second advantage was that when the codes were tested on it, performance was extremely good.
The guiding principles, according to Bill Camp, were that the architecture and every component of HW and SW were chosen based on the SURE methodology, that the system was scalable, usable, reliable and economic.
The system is divided into service, log-in, I/O and visualisation nodes, 1280, and compute nodes, 10.368. It will have a peak performance of 41.47 TeraFlop/s. Camp presented some benchmark examples - of their codes - and compared it with ASCI Red, which gave an improvement of a factor of 8 to 12.
Additionally they analysed clusters and vector processors, their programmes are not so vectorisable.
Bill Camp concluded for Red Storm that commodity is nearly everywhere but that customization drives cost. The Earth Simulator and Cray X1 are fully custom vector systems with a good balance. This drives their high cost and, of course, their high performance. Clusters are nearly entirely high-volume with no truly custom parts which drives their low-cost and their low scalability. Red Storm uses custom parts only where they are critical to performance and reliability.
The result is high scalability at a minimal cost/performance.
Fred Weber from AMD then took over to talk about the anatomy of a supercomputer.
Many of AMD's design goals were not the same as Sandia's priorities. He named the instruction set, the legacy support, the development platforms, the OS support, and the ISV support. The factors that were important to Sandia included balanced processing, memory bandwidth, I/O bandwidth, reliability, power efficiency, density, COTS components, and long term support.
Fred Weber presented some of the design goals of Hammer and where specifications could be modified. Additionally he showed architectural features and the interconnect evolution, HyperTransport. He mentioned that this processor was not designed for Red Storm.
He also addressed the topic of x86 in High Performance Computing, the Six System Challenges. He said that x86 is the most widely installed instruction set in the world. The instruction set is not relevant to CPU performance - "to first order". What is important is the system's backward compatibility to x86-32. There is an enormous investment is IA32 for all market segments. In many applications, porting code is not an option. It is necessary to provide a solution that is not only 100% backwards compatible, but designed to run IA32 code faster then any existing 32-bit architecture available. There has to be a gradual and controlled migration path for porting to AMD64 and one has to make
the total cost of ownership minimal.
The cost per processing node is due to cost/performance and I/O constraints. IA32 clusters are limited to two processors putting additional stress on SMP cluster interconnect. One has to bring 4 and 8 processor SMP systems closer in cost/performance to 2 processor systems. Fred Weber also addressed the need to
improve performance and decrease premium without breaking IA32 "commodity" economics. This is only possible if the same processor architecture is used on the desktop.
The third challenge constitutes the memory bandwidth. With increased system memory, come data intensive applications with strides and block sizes that cause cache thrashing. Making the cache larger is not cost-effective. Hence, performance is limited by the size of on chip cache and/or memory bandwidth. Therefore one has to improve memory bandwidth and latency and limit the cache size.
Then there is the addressable memory. Large RAM resident databases and memory intensive applications exceed the 4 Giga-Bytes limit of 32 bit systems. Paging is not a solution. AMD64 processing is the only real solution.
Concerning the I/O infrastructure, the bandwidth of a Front Side Bus causes an I/O bottleneck which continues to exclude IA32 from running challenging parallel applications. One has to provide a dedicated I/O buss which is separate from the memory bus and keeps pace with next generation I/O protocols and CPU clock.
The last issue constitutes Watt density. With clusters exceeding 10,000 processors, watt density is an important issue. As cluster size expands, cooling capacity and costs can be significant. The challenge is to design the lowest watts/Gig Cycle solution leveraging start-of-the-art AMD64 architecture and silicon on insulator process, concluded Fred Weber.
|