The conception of this benchmark was motivated by the concern that
previous benchmarks were too much CPU-speed oriented and thus would not
trustworthy reflect the actual capabilities of the HPC systems
considered. In the notes from the IDC HPC Forum it was justly remarked
that the CPU speed was just one of the aspects that define the
performance of an HPC system, but that the bandwidth/latency of the
(cache)memory and the scalability of the system would at least be of
equal importance, if not more so. These are very astute considerations
and so is the HPC Forum's goal as worded in an IDC Bulletin:
"To create a more meaningful high-level comparison of technical
computers as compared to peak performance metrics, in oder to show a
genral rating of how powerful different HPC computers are on a general
purpose workload."
What is not explicitly expressed in this quote, and indeed not in the
background information altogether, is that the outcome of the Benchmark
exercise is a rating consisting of a single number that represents the
performance of the entire computer system. Such a single number
characterisation of computer systems is very desirable both for buyers
and vendors as it is a particularly simple way of assessing the value
of a system performancewise. It should also make it easy to rank the
systems in the HPC Hall of Fame. Unfortunately, in benchmarking HPC
systems, simplicity is NOT a guarantee for a truthful reflection of
reality. Although the same IDC bulletin does warn that the Balanced
Rating HPC Benchmark may not suit your particular needs and so the
ratings should be viewed with great care, the decision to represent the
performance of an HPC system by a single number, is an extremely
unlucky one.
This brings us to the heart of the question: How informative is the IDC
Balanced Rating HPC Benchmark? For this we have to look at the way it
is composed. First, three aspects of HPC systems are addressed:
Processor Performance, the Memory Subsystem, and the Scaling
Capability. This is in line with the basic philosophy of the IDC
Benchmark and looks as a good starting point. It was presumably the
reason for Chris Lazou to congratulate the HPC Forum in his letters to
HPCWire and Primeur with the new benchmark and I heartily wish I could
agree with him, but unfortunately have to agree with John McCalpin who
responded in HPCWire because of the way the rating procedure is
implemented: John not only pointed out numerous mistakes in the data
that were taken from his STREAMS TRIAD benchmark for the Memory
Subsystem ranking but, much more importantly, questioned the way how
the components are put together to arrive at the final number.
Let us
for instance consider the Processor Performance part: it is the
arithmetic mean of scaled results from the LINPACK benchmark and of
SPECfp_rate2000. The scaling for each of the components is done by
normalising them on a scale of 0-100, with a ranking of 100 for the top
result. This reduction of scaled results with arbitrary equal weights
of the components to one number also reduces the information content to
something uninterpretable. Note that the SPECfp_rate2000 already is the
geometric mean of the throughput of multiple copies of the 14 SPECfp
programs in jobs/hour, presumably to cover the throughput aspect.
Apart from the question whether the SPECfp_rate2000 in itself is a
reasonable throughput metric, the combined results from the
two components, scaled raw observed speed and throughput capability,
makes it impossible to assess how much of this compound result is to be
attributed to each of the aspects they represent and if a user of this
benchmark for some valid reason would like have one of these aspects to
have more emphasis there is no way know how to adjust weights in the
combination.
This is only for the Processor Performance part. Similar
procedures are used for the Memory Subsystem part and the Scalability
part, only somewhat more involved because more parameters are included.
The results of all three parts are again combined by an unweighted
arithmetic mean to get the final ranking. Again I have to agree
completely with John McCalpin that the information content of this
overall rating is close to nill. In the tables available from IDC the
subratings also are given, but due to reduction to a single number for
each of the system parts this does not help much and there is another
catch. The last column in the tables contains a "1" in many cases which
means that data were missing for the system displayed. Still, a rating
is given. Instead of leaving out results for systems with incomplete
data, it is the IDC/HPC Forum's policy to assume a value "10% below the
value of the arithmetic mean for that value for all systems involved"
because "a performance reason is assumed for not supplying the value".
This policy has a flavour of coercion if not a stronger term could be
used. Vendors will in this way be "encouraged" to turn in the missing
data even when they, like the author, think the methodology is flawed.
This might lead to something that superficiously may be interpreted as a
high acceptance level of this benchmark but in fact has come about
because of fear for a bad IDC rate.
So, let us return to the question heading this note: How informative is
the IDC Balanced Rating HPC Benchmark? The answer can be short: It is
not. It looks like all the time and effort invested in the HPC Forum has
yielded a sub-standard product that should be radically improved or
withdrawn. Presently it only adds to the confusion that already exists
on the HPC Benchmark scene.
Are there no alternatives for the IDC Benchmark? There are, if one
would abstain from one requirement: that the performance of an HPC
system could be characterised by one number. This may be unpleasant but
as unavoidable as gravity: All the different aspects that influence the
performance in a HPC system interact in a complicated way and, on top
of it, the interactions are different for different application areas.
So, the way to go would be to define benchmark programs that mimimally
cover (a large part of) the application space. Of this approach, in
fact, both the Linpack and the STREAMS benchmark are good examples. Of
course, Linpack only covers a narrow part of the space to be looked at
but it has the advantage that one exactly knows what is measured and
where it applies. Likewise, in the STREAMS benchmark a small set of
bandwidth kernels is executed that represent an important class of
simple operations that turn up frequently in floating-point dominated
computations. As such, they give a first hint of what may expected
performancewise in codes that contain them. The EuroBen benchmark is
built on the same philosophy and includes apart from Linpack and
STREAMS-like kernels also basic algorithms and kernel applications.
A similar reasoning holds for the throughput of a system as
configuration, operating system, and scheduling software come into play
here. David Bailey reported about an interesting throughput exercise at
SC'2000: "ESP: A System Utilization Benchmark" that could be taken as a
starting point, likewise the EuroBen Throughput Benchmark Framework
could be used.
For distributed-memory/scalability benchmarking it is the same: there
are alternatives, like the EuroBen-DM benchmark, the NPB, and
PARKBENCH. Each of these will learn you more about the scalability of
systems and give you more insight than the IDC benchmark will.
The only price to pay is to drop the idea that a single number could
give you a good picture of the total, balanced, performance of an HPC
system. This is not a high price when you really want to have some
insight in the performance of the systems you will use or buy.