Supercomputing always has been a cost-performance trade-off - An interview with Justin Rattner - part 1

Heidelberg 07 April 2003 One of the keynotes at Europe's main supercomputer event, the international Supercomputing Conference in Heidelberg, in June, will be delivered by Justin Rattner, designer of the first machine to break the 1 Tflop/s barrier: the ASCI Red supercomputer. Currently he is focusing on processor design at Intel. In this first part of the interview, we asked him, of course, about the Earth Simulator and the impact, if any, on processor design. Will the next Intel processor be a vector processor? Read the complete article for the answer.

Primeur: Until last year, everyone thought vector processors were dead, but now the Earth Simulator is in the first position of the TOP500. Is this because processor designers just did not focus on building high performance processors, or is it impossible to build them faster?

Justin Rattner: Supercomputing always has been a cost-performance trade-off. There was a time when there really were not any alternatives for specially built machines. But over the last 10-15 years parallel machines came along and demonstrated that one could take relatively inexpensive processors and build very high performance machines. Now, that said, it is still possible if one is willing to undertake the design to build a special purpose machine and achieve comparable or perhaps even higher performance.

The MPPs today for most part are built from standard components and as a result they are very cost-effective, but I do not think it precludes special purpose machines. I don't know a lot about the Earth Simulator, but I can comment on vector extensions as this is closer to my current research. We looked at vector extensions to the Intel processors, from a research point of view. Basically both the IA32 and the Itanium processor family have SIMD capabilities. We have looked beyond that to full vector extensions. This has its pluses and minuses. We certainly have not committed anything in terms of products, but there is nothing preventing us from adding vectors.

But the real issue is memory bandwidth. That is probably the hardest trade-off to make. High-volume machines tend to have substantially lower memory bandwidth than classic vector machines. Very high performance architectures like Itanium have similar high bandwidth demands. It is much more difficult to justify putting a high-memory bandwidth interface on those processors and supporting that interface with a high bandwidth memory subsystem. From my point of view that is the hardest to make the trade-off.

Primeur: I did not just want to focus on vector processing, but on any new design. And, of course you are right, a vector processor is not more than an efficient way to get data from the memory into the processor and back again for a limited number of applications.

Justin Rattner: Even Itanium, on non-cacheable types of applications - the technical applications - tends to utilise, at least in part, higher memory bandwidth. The problem is making the trade-off in favour of those applications versus commercial applications like large-scale databases which make very effective use of the big caches. The volume of the database applications is just so much higher than it is for technical applications. But it is not just vector machines. The Itanium family has high floating-point capabilities and you certainly could design an Itanium with a much higher memory bandwidth for those technical applications.

Primeur: So we will see MPP machines for many years to come?

Justin Rattner: Yes, but we will see more floating point capabilities in the high-volume market and perhaps that will include vector processing features as well.

Primeur: Talking about bandwidth. There are already examples of situations where the bandwidth into the machine is much higher than the machine can handle. There was an example at iGrid2002 in Amsterdam with an aggregrate bandwidth of 35 Gbit/s over the ocean with low latency. There is also the work on the Optiputer. Is that the way supercomputers will go?

Justin Rattner: Absolutely. There is a long term trend towards higher bandwidth interfaces. The very high-end MPP machines will likely drive that internal interconnect bandwidth. The broader market for MPP machines will either look at standard network for their interconnect. Today that is typically 1 Gbit/s Ethernet connection. Over time that will probably move to 10 Gbit/s, although that is probably the second half of the decade and then there will be interconnects like InfiniBand which are sort of halfway between standard network technology and proprietary kind of interconnections. That will be attractive for many applications.

If you got a perfectly parallel programme, the interconnect is much less important but if you have applications that do a lot of communication, they really benefit from higher performance interconnect.

Primeur: Will that also have influence on the way that computers are built? One can imagine a scenario where sometimes large machines will be linked together in a kind of metacomputing environment to run a specific application or it will go the other direction: there will be very large computers that will be configured according to the needs of lots of people and lots of different applications. If bandwidth grows and processor speed is not an issue, you can imagine that computer centres grow bigger, or you can imagine that they crumble and in the end the whole world is just one big computer.

Justin Rattner: There are many application scenarios where latency is critical. In those cases you have to have machines that are at least in close physical proximity. As soon as you start routing packets, you introduce a considerable amount of latency. It is perfectly reasonable to have highly distributed applications where portions are geographically isolated and then those application components can be interconnected with higher bandwidth and higher latency.

It really gets down to how latency-tolerant the application is. Some are and some are not. It seems like a general rule of thumb that the more sophisticated the algorithm is, the more latency prone it is. When you have an embarrassingly parallel application, it does not matter. But when you make the transition from dense matrix applications to sparse, latency becomes absolutely critical. You must go to sparse matrices to save computational cycles but it puts much more pressure on the interconnect, much more pressure on the memory system and so forth. You are also likely to be less cacheable.

Read also the complete interview:


Ad Emmen

[News on Advanced IT]   [Calendar]   [Analysis]   [IT in Medicine]