How informative is the IDC Balanced Rating HPC Benchmark?

Utrecht 13 February 2002 After several meetings of the IDC-convened HPC Forum in the period end 1999 - end 2001 a new benchmark for High Performance Systems was defined and the first results were released during the SC 2001 Conference in Denver in November 2001: the IDC Balanced Rating HPC Benchmark. In this article the usability of this new benchmark is discussed. We conclude it is not very usable. Some suggestions on better benchmarks are given.

The conception of this benchmark was motivated by the concern that previous benchmarks were too much CPU-speed oriented and thus would not trustworthy reflect the actual capabilities of the HPC systems considered. In the notes from the IDC HPC Forum it was justly remarked that the CPU speed was just one of the aspects that define the performance of an HPC system, but that the bandwidth/latency of the (cache)memory and the scalability of the system would at least be of equal importance, if not more so. These are very astute considerations and so is the HPC Forum's goal as worded in an IDC Bulletin:

"To create a more meaningful high-level comparison of technical computers as compared to peak performance metrics, in oder to show a genral rating of how powerful different HPC computers are on a general purpose workload."

What is not explicitly expressed in this quote, and indeed not in the background information altogether, is that the outcome of the Benchmark exercise is a rating consisting of a single number that represents the performance of the entire computer system. Such a single number characterisation of computer systems is very desirable both for buyers and vendors as it is a particularly simple way of assessing the value of a system performancewise. It should also make it easy to rank the systems in the HPC Hall of Fame. Unfortunately, in benchmarking HPC systems, simplicity is NOT a guarantee for a truthful reflection of reality. Although the same IDC bulletin does warn that the Balanced Rating HPC Benchmark may not suit your particular needs and so the ratings should be viewed with great care, the decision to represent the performance of an HPC system by a single number, is an extremely unlucky one.

This brings us to the heart of the question: How informative is the IDC Balanced Rating HPC Benchmark? For this we have to look at the way it is composed. First, three aspects of HPC systems are addressed: Processor Performance, the Memory Subsystem, and the Scaling Capability. This is in line with the basic philosophy of the IDC Benchmark and looks as a good starting point. It was presumably the reason for Chris Lazou to congratulate the HPC Forum in his letters to HPCWire and Primeur with the new benchmark and I heartily wish I could agree with him, but unfortunately have to agree with John McCalpin who responded in HPCWire because of the way the rating procedure is implemented: John not only pointed out numerous mistakes in the data that were taken from his STREAMS TRIAD benchmark for the Memory Subsystem ranking but, much more importantly, questioned the way how the components are put together to arrive at the final number.

Let us for instance consider the Processor Performance part: it is the arithmetic mean of scaled results from the LINPACK benchmark and of SPECfp_rate2000. The scaling for each of the components is done by normalising them on a scale of 0-100, with a ranking of 100 for the top result. This reduction of scaled results with arbitrary equal weights of the components to one number also reduces the information content to something uninterpretable. Note that the SPECfp_rate2000 already is the geometric mean of the throughput of multiple copies of the 14 SPECfp programs in jobs/hour, presumably to cover the throughput aspect. Apart from the question whether the SPECfp_rate2000 in itself is a reasonable throughput metric, the combined results from the two components, scaled raw observed speed and throughput capability, makes it impossible to assess how much of this compound result is to be attributed to each of the aspects they represent and if a user of this benchmark for some valid reason would like have one of these aspects to have more emphasis there is no way know how to adjust weights in the combination.

This is only for the Processor Performance part. Similar procedures are used for the Memory Subsystem part and the Scalability part, only somewhat more involved because more parameters are included. The results of all three parts are again combined by an unweighted arithmetic mean to get the final ranking. Again I have to agree completely with John McCalpin that the information content of this overall rating is close to nill. In the tables available from IDC the subratings also are given, but due to reduction to a single number for each of the system parts this does not help much and there is another catch. The last column in the tables contains a "1" in many cases which means that data were missing for the system displayed. Still, a rating is given. Instead of leaving out results for systems with incomplete data, it is the IDC/HPC Forum's policy to assume a value "10% below the value of the arithmetic mean for that value for all systems involved" because "a performance reason is assumed for not supplying the value". This policy has a flavour of coercion if not a stronger term could be used. Vendors will in this way be "encouraged" to turn in the missing data even when they, like the author, think the methodology is flawed. This might lead to something that superficiously may be interpreted as a high acceptance level of this benchmark but in fact has come about because of fear for a bad IDC rate.

So, let us return to the question heading this note: How informative is the IDC Balanced Rating HPC Benchmark? The answer can be short: It is not. It looks like all the time and effort invested in the HPC Forum has yielded a sub-standard product that should be radically improved or withdrawn. Presently it only adds to the confusion that already exists on the HPC Benchmark scene.

Are there no alternatives for the IDC Benchmark? There are, if one would abstain from one requirement: that the performance of an HPC system could be characterised by one number. This may be unpleasant but as unavoidable as gravity: All the different aspects that influence the performance in a HPC system interact in a complicated way and, on top of it, the interactions are different for different application areas. So, the way to go would be to define benchmark programs that mimimally cover (a large part of) the application space. Of this approach, in fact, both the Linpack and the STREAMS benchmark are good examples. Of course, Linpack only covers a narrow part of the space to be looked at but it has the advantage that one exactly knows what is measured and where it applies. Likewise, in the STREAMS benchmark a small set of bandwidth kernels is executed that represent an important class of simple operations that turn up frequently in floating-point dominated computations. As such, they give a first hint of what may expected performancewise in codes that contain them. The EuroBen benchmark is built on the same philosophy and includes apart from Linpack and STREAMS-like kernels also basic algorithms and kernel applications. A similar reasoning holds for the throughput of a system as configuration, operating system, and scheduling software come into play here. David Bailey reported about an interesting throughput exercise at SC'2000: "ESP: A System Utilization Benchmark" that could be taken as a starting point, likewise the EuroBen Throughput Benchmark Framework could be used.

For distributed-memory/scalability benchmarking it is the same: there are alternatives, like the EuroBen-DM benchmark, the NPB, and PARKBENCH. Each of these will learn you more about the scalability of systems and give you more insight than the IDC benchmark will. The only price to pay is to drop the idea that a single number could give you a good picture of the total, balanced, performance of an HPC system. This is not a high price when you really want to have some insight in the performance of the systems you will use or buy.


Aad van der Steen

[News on Advanced IT]   [Calendar]   [Analysis]   [IT in Medicine]