SC2000 Technical programme - OpenMP coming to age - first results Giganet

Utrecht 30 November 2000 The interplay between MPI and OpenMP was another sign of the coming of age of OpenMP. In the cluster infrastructure session the first results of comparing Myrinet against Giganet's cLan were presented. In the applications sessions Bill Gropp from Argonne National Laboraroty reported about experiments with an unstructured mesh CFD code using PETSc, a parallel scientific library developed at ANL

The interplay between MPI and OpenMP was another sign of the coming of age of OpenMP. The results that were discussed by David Henty, EPCC and Hongzhan Shan, Ames Research Center in their respective talks were not surprising but nevertheless useful in their observations that OpenMP will not (yet) outperform MPI on an SMP system due to thread management and synchronisation overhead. Hongzhan Shan reported also results using the Cache Coherent Shared Address Space approach which outperformed both the MPI and the OpenMP implementation on a Barnes-Hut code and a dynamic mesh application. Also Frank Cappello found that an all-MPI implementations did better on multi-way SMP IBM SP systems than an MPI/OpenMP codes in the NAS Parallel Benchmark 2.3 set.

In the cluster infrastructure session the first results of comparing Myrinet against Giganet's cLan were presented. While Myrinet is already around for quite a while, cLan is rather new and builds on a hardware implementation of the Virtual Interface Architecture (VIA). It turned out that the latency for point-to-point communication is roughly 3 times lower for the Myrinet network than for cLan. However, for long messages cLan outperforms Myrinet by 10--20\% despite the fact that the theoretical peak bandwidth for Myrinet is about 20\% higher. This can at least partly be explained by the polling done by Myrinet's MPICH-GM resulting in a noticeable CPU overhead. In the same session Toshiyuki Takahashi from the Real World Computing Partnership (RWCP) discussed a middleware layer, PM2, that supports various MPI environments transparently. Currently Myrinet's MPICH-GM, Ethernet-based MPICH and an SMP aware MPI version are supported by RWCP's MPICH-SCore. Linking with the appropriate library enables the use of binary code across platforms using the different MPI flavours.

In the applications sessions Bill Gropp from Argonne National Laboraroty reported about experiments with an unstructured mesh CFD code using PETSc, a parallel scientific lirbary developed at ANL. He described the tuning of this code which included the preconditioning of the matrix describing the model in single precision because it determines a very approximate solution anyway. This idea resulted indeed in a net speedup of the code. Furthermore MPI/OpenMP and MPI-only implementations were considered. With 2 CPUs/node a somewhat better performance was observed for the former implementation on the ASCI Red machine, contrary to present experiences on other platforms. The actual bandwidth as derived for this code is low: 4.6-6.9 MB/s as contrasted with the available bandwidth of about 90 MB/s. A third item in the tuning was the use of the grid partitioner KMetis instead of the more generally used PMetis. KMetis attempts to minimise the distances between the nodes and achieved a somewhat better performance than PMetis. Following the recent trends applications from the genomics field were now entering the application sessions exemplified by a talk of Quinn Snell on Parallel Phylogenetic Inference. His talk contained several remarkable elements. First, although parallel algorithms exist to search DNA databases, they are not used because researchers do not consider them to be as reliable as the current sequential algorithms. Quinn described the DOGMA package that is parallel but uses the ``classical'' algorithms. A second remarkable fact was the large speedup that was achieved: a serial search that would take 11 months can be reduced to 2 hours on an eight processor system, clearly a speedup that is far beyond the ideal linear speedups hoped for in most numerical codes.

With respect to a few years ago the situation of performance analysis based on hardware support has improved enormously. Virtually all important hardware platforms now contain CPUs that are equipped with hardware counters that capture events like floating point operations, cache misses, TLB misses, etc. Built upon these counters tools can be developed that give detailed information about the behaviour of a program and, ultimately, about its level of performance. Brian Buck presented work in which cache misses are recorded in order to pin-point the address regions they are associated with. Performance counter interrupts and n-way search were compared the in problematic regions. Simulations indicate that both methods can provide the required information. A second talk in this session presented the performance analysis tools from the PAPI project at the University of Tennessee. Jack Dongarra first showed some results of this project in June at the HPC workshop in Cetraro, Italy. Since then, the base of platforms for which it is available and the functionality have been extended considerably. The library contains a ``performometer'' that enables to display the flop rate of a code instantaneously in a graph vs. the elapsed time. Furthermore, regions of code can be colour-marked to bring out parts that perform at distinct levels. Presently, a ``cahceometer'' routine is near completion which in the same way can display the cache misses/hits graphically. Evidently, such tools can be of great use in tuning codes.


Aad van der Steen

[News on Advanced IT][Calendar][Analysis][IT in Medicine]