Sustained Teraflop/s still elusive for civilian use
Heidlberg 21 jun 2001 My last article looked at the challenges for producing devices for Petaflop/s computing in the next ten years or so. It then went on to mention quantum computing and the inherent difficulties in developing Qubits with the necessary coherence to deliver stable usable devices for computing. The article below explores the reasons why in reality there is no civilian general purpose system capable to deliver sustained Teraflop/s performance within normal budgets of say 5-10 million US dollars a year. (author: Chris Lazou)
As I stated last time, the computer industry thrives on sound-bites and the pursuit of the plausible. The moment one can see a glimmer of achieving Petaflop/s even in a very limited specialist system such as the IBM blue Gene, the next thing bandit about is Hexaflop/s machines. This aspiring tendency is of course a great attribute of human nature and should be applauded, but in reality to date there is no civilian general purpose system capable to deliver sustained Teraflop/s performance.
The ASCI systems are special purpose military ones, require mega-watts power stations to keep them running, and have a skewed (enormous) amount of resources at their disposal. The only civilian (and again special system) with the potential for Teraflop/s sustained is the Earth Simulator in Japan, and this system is not going to be in production until spring 2002.
The Earth Simulator consists of over 5000 NEC vector processors, made using 0.15 micron technology, and each processor having an 8Gflop/s peak performance. Even with the high memory bandwidth inherent in NEC systems, the Japanese are only anticipating 5 Teraflop/s sustained, about twelve percent, of the available 40
Teraflop/s peak. The reason for this low estimate, is of course derived by factoring parallelism, communication and software difficulties in extracting the theoretical performance of the hardware devices.
The ASCI programme put conventional scaling of devices at the heart of their development and yet if one looks closer at the architecture this is only part of the problem. Not only it is an inadequate solution, but it often magnifies other architectural bottlenecks as a side effect. As I stated on many occasions in the past, high end supercomputing is more than a chip, it also involves memory bandwidth, heat extraction, pin numbers for out of chip connections, and tight communication system integration to deliver high optimisation efficiencies. This is why massive performance within a chip is of marginal use in attaining high sustained performance.
Who needs Teraflop/s sustained performance.
At present, many established Climate/Weather forecasting centres, are either preparing or, have plans (RFIs) in the pipeline for procuring Teraflop/s systems. When one talks to the people at these centres, the message is very clear. They need systems with Teraflop/s sustained performance, so they can refine their Global models to use a grid with say 35Km and 100 levels in the vertical. Ideally they would like to go down to 5Km, but that is not practical with expected computer systems, in the next five years.
Climate and Weather forecasts have evolved from the simple atmosphere models of yesteryears. They now couple their models of the atmosphere with Ocean models as well as enhancing their models to include more chemistry - air pollution - and so on. Incidentally, the urgent focus on Ocean simulation and prediction, is an imperative consequence of the Kyoto agreement.
Operationally, Climate and Weather forecasting has also evolved and where in the past some 10% of computing was used for data assimilation and 90% for forecast determination, now the ratio for example, at ECMWF, is roughly divided to 10% forecast determination, 45% data assimilation and 45% ensemble calculations. Experimentally weather modellers have taken the next step, changed from 3-D VAR to 4-D VAR data assimilation and started ensemble calculations. Once ensemble simulations become the norm for forecasting, the computing requirement shifts towards throughput rather than single maximum performance. It is a painful fact that at present, no system is capable to deliver the level of (Teraflop/s sustained) performance within (say 5 to 10 million US$ a year), normal budgets.
Large HPC systems are all very parallel.
Stating the obvious, today's large scale high end computer systems are all very parallel and also have some specialisation feature reflected in their "McMahon specialisation number". Also most large scale computer centres are multi-vendor shops using a mix of system architectures. But even with 99.5% parallelism within a code, because of communication and other elements involved in a real production phase of any application, it is practically impossible to deliver 1Teraflop/s using a large number say 10,000 PCs of 1 Gflop/s each. In this instance one would be lucky to get 200Gflop/s.
This can easily be verified by using the law of diminishing returns - i.e. the enhanced Ware model, otherwise known as Amdahl's Law. In fact the fewer, more powerful processors and high memory bandwidth one uses, the better chance one has to achieve high sustained performance. Thus, the chances of delivering 1Teraflop/s sustained are greatly improved when using 256 Processors say, with 10 Gflop/s each and high memory bandwidth. The formula to calculate performance using parallel processor systems, can be found in my book on supercomputers[1}, (pp156-163).
Apart from high CPU performance, the architecture requirements include fast networks and large flat memories with very high bandwidth, and fast local to remote CPU communications. The next factors which kick in to frustrate the goal of achieving Teraflop/s are operating system, languages, libraries, data management and
so on. Despite all these detrimental factors, few sustained Teraflop/s systems are likely to be in production in 2003-4, but these are likely to be using proprietary chipsn with high memory bandwidth integration. In the meantime there is a lot of work to be done by vendors and software developers to get there.
What meteorology users are saying.
Enough of my views, let us hear what meteorology users have to say on the subject. At a recent conference in Europe, with a strong meteorology content, a panel was asked to address the following question: What are the challenges in meteorology, from both the science and operational point of view?
The panellists agreed that in Europe they should be capable to provide forecasts up to 3 days in advance for every country on a 5 km grid. If one looks at the Alpine Mesoscale project a 4-5 times finer resolution is likely to be required for obtaining accurate results. This translates to 5-10 Teraflop/s sustained computing power.
They also need the capability to perform comfortably thousands of experiments as finer resolution may not give as good a result as an ensemble. The field is also evolving towards larger time-span experiments. This prognostication is confirmation of the need for Teraflop/s.
In their view, data manipulation is not keeping pace with compute power. Moving from 3-D VAR to 4-D VAR generates enormous volumes of data, so, data integration and data consolidation are very important aspects of their operational needs. Thus, in the area of data compression something drastic has to be done urgently.
There was also concern about application code portability over time, especially, when moving to new hardware. In their view, when one has to move from one system to another, this is often a tragedy. Sometimes codes would not even port to the new system let alone run in an optimised fashion.
Although, the Rood report, recommends a distributed solution, (one suspects this was a pragmatic choice rather than based on science, a consequence of the USA embargo on Japanese vector systems prevailing at the time); the panellists thought that in that case it is imperative for supercomputers to integrate smoothly and seamlessly.
Some panellists expressed the view that, for meteorology applications the GRID is not of immediate usefulness. Every time one moves out of a node the program is in trouble concerning efficiency. Teraflop/s nodes is the preferred solution.
Another view expressed, concerning the modalities of the GRID, is that structuring one's program is very important. Not caring where the program runs is not relevant to meteorology applications. Both computing and data are site specific, both in the manipulation and output distribution. What is important is understanding the
computer architectures. For example, IFS runs on the Cray T3E and on vector machines, but this requires a lot of effort to achieve acceptable performance.
Will the new generation chips deliver.
In summary, although the new generation of proprietary design scalar architecture, which has a very aggressive compact integration, such as, the P4 from IBM, is making some headway, (around 20% of peak is expected), but these systems are unlikely to deliver sustained Teraflop/s performance for the general computer centre.
Indeed, in meteorology applications, whether one uses a single program or an ensemble, the practical aspects of delivering Tflop/s sustained performance, given the data assimilation and computation mix of the application, are horrendous. According to some computer centres directors, vector parallel systems (delivering 70% of peak)
are likely to remain the work horse in European weather forecasts for the next 3 to 4 years.
One can muse that, between the idea and reality stands the shadow of time which it takes to develop a system. Memory hierarchies, memory bandwidth, communication interfaces and system software compatibility are not only essential elements, but would have enormous influence on delivering Teraflop/s to the user. The balance of
these systemic elements are most likely to determine the fortunes of any future HPC product, in the market place.
Chris Lazou
[News on Advanced IT][Calendar][Analysis][IT in Medicine]
|