SC2000 Technical programme - OpenMP coming to age - first results Giganet
Utrecht 30 November 2000 The interplay between MPI and OpenMP was another sign of the coming of
age of OpenMP. In the cluster infrastructure session the first results of comparing
Myrinet against Giganet's cLan were presented. In the applications sessions Bill Gropp from Argonne National
Laboraroty reported about experiments with an unstructured mesh CFD
code using PETSc, a parallel scientific library developed at ANL
The interplay between MPI and OpenMP was another sign of the coming of
age of OpenMP. The results that were discussed by David Henty, EPCC and
Hongzhan Shan, Ames Research Center in their respective talks were not
surprising but nevertheless useful in their observations that OpenMP
will not (yet) outperform MPI on an SMP system due to thread management
and synchronisation overhead. Hongzhan Shan reported also results using
the Cache Coherent Shared Address Space approach which outperformed both
the MPI and the OpenMP implementation on a Barnes-Hut code and a dynamic
mesh application. Also Frank Cappello found that an all-MPI
implementations did better on multi-way SMP IBM SP systems than an
MPI/OpenMP codes in the NAS Parallel Benchmark 2.3 set.
In the cluster infrastructure session the first results of comparing
Myrinet against Giganet's cLan were presented. While Myrinet is already
around for quite a while, cLan is rather new and builds on a hardware
implementation of the Virtual Interface Architecture (VIA). It turned
out that the latency for point-to-point communication is roughly 3 times
lower for the Myrinet network than for cLan. However, for long
messages cLan outperforms Myrinet by 10--20\% despite the fact that the
theoretical peak bandwidth for Myrinet is about 20\% higher. This can at
least partly be explained by the polling done by Myrinet's MPICH-GM
resulting in a noticeable CPU overhead. In the same session Toshiyuki
Takahashi from the Real World Computing Partnership (RWCP) discussed a
middleware layer, PM2, that supports various MPI environments
transparently. Currently Myrinet's MPICH-GM, Ethernet-based MPICH and an
SMP aware MPI version are supported by RWCP's MPICH-SCore. Linking with
the appropriate library enables the use of binary code across platforms
using the different MPI flavours.
In the applications sessions Bill Gropp from Argonne National
Laboraroty reported about experiments with an unstructured mesh CFD
code using PETSc, a parallel scientific lirbary developed at ANL. He
described the tuning of this code which included the
preconditioning of the matrix describing the model in single precision
because it determines a very approximate solution anyway. This idea
resulted indeed in a net speedup of the code. Furthermore MPI/OpenMP
and MPI-only implementations were considered. With 2 CPUs/node a
somewhat better performance was observed for the former implementation
on the ASCI Red machine, contrary to present experiences on other
platforms. The actual bandwidth as derived for this code is low:
4.6-6.9 MB/s as contrasted with the available bandwidth of about 90
MB/s. A third item in the tuning was the use of the grid partitioner
KMetis instead of the more generally used PMetis. KMetis attempts to
minimise the distances between the nodes and achieved a somewhat better
performance than PMetis. Following the recent trends applications from
the genomics field were now entering the application sessions
exemplified by a talk of Quinn Snell on Parallel Phylogenetic
Inference. His talk contained several remarkable elements. First,
although parallel algorithms exist to search DNA databases, they are
not used because researchers do not consider them to be as reliable as
the current sequential algorithms. Quinn described the DOGMA package
that is parallel but uses the ``classical'' algorithms. A second
remarkable fact was the large speedup that was achieved: a serial
search that would take 11 months can be reduced to 2 hours on an eight
processor system, clearly a speedup that is far beyond the ideal linear
speedups hoped for in most numerical codes.
With respect to a few years ago the situation of performance analysis based
on hardware support has improved enormously. Virtually all important
hardware platforms now contain CPUs that are equipped with hardware
counters that capture events like floating point operations, cache
misses, TLB misses, etc. Built upon these counters tools can be
developed that give detailed information about the behaviour of a
program and, ultimately, about its level of performance. Brian Buck
presented work in which cache misses are recorded in order to pin-point
the address regions they are associated with. Performance counter
interrupts and n-way search were compared the in problematic regions.
Simulations indicate that both methods can provide the required
information. A second talk in this session presented the performance
analysis tools from the PAPI project at the University of Tennessee.
Jack Dongarra first showed some results of this project in June at the
HPC workshop in Cetraro, Italy. Since then, the base of platforms for
which it is available and the functionality have been extended
considerably. The library contains a ``performometer'' that enables to
display the flop rate of a code instantaneously in a graph vs. the
elapsed time. Furthermore, regions of code can be colour-marked to bring
out parts that perform at distinct levels. Presently, a ``cahceometer''
routine is near completion which in the same way can display the cache
misses/hits graphically. Evidently, such tools can be of great use in
tuning codes.
Aad van der Steen
[News on Advanced IT][Calendar][Analysis][IT in Medicine]
|