Compaq ports Shmem and Checkpoint/Restart on ASCI clusters
Munich 23 Sep 99 Within the US ASCI Pathforward Program Compaq in collaboration with Quadrics Supercomputers World (QSW) Ltd, Bristol UK, ports Shmem, the fast Cray communication library, on Alpha clusters. Additionaly they are implementing Checkpoint/Restart, as an add on to the QSW Resource Management System for production support of long running jobs. These elements maximise the compatibility between current systems and protect the investments in software. As Cray's highly optimised scientific library (SCILIB) is still running on the Alphas, Compaq and QSW offer the Cray T3E environment for their Cluster, which is connected via QsNet, the Quadrics Communication Network. With all these elements Compaq plans to position its ASCI-system - the Codename is Sierra - in the T3E arena, as there is no successor of this computer announced by SGI/Cray. Will Sierra be a T3F, or even as others call it U4F, as all the dimensions of the T3E have been incremented.
Within the scope of the U.S. ASCI Pathforward Program Compaq and Quadrics port Shmem, the fast communication library, on its Cluster of Workstations for the High-Performance Technical Market. The Cluster, ranked number 49 in the actual Top500 list, is based on standard Alpha workstations - DS20. Probably the first release will be available by the end of this year. Quadrics delivers the fast interconnect of the DS20 - the system area network. An other important issue for the support of production level computing is Checkpoint/Restart. Long running jobs write checkpoints on disk within certain time intervals. If the computer fails because of hardware or software problems, or if the job has to be dropped because of operational reasons, the user restarts his job from the last checkpoint. This saves the used computing time, only the small amount of time between the checkpoint and the drop is lost. This task is realised within the ASCI activities too. Scilib, Cray's highly optimised scientific library, is still available on the Alpha platforms. Shmem - the innovative Cray Shared Memory Access Library - extends the explicit parallel programming capabilities of MPI and PVM. It provides a very fast interprocessor communication using data passing or one-sided communication techniques. It also includes several highly optimised subroutines for collective reduction and other global operations. Low-latency, high-bandwidth data transfer to and from the memory of other PEs on the Cray T3E is possible. Such remote memory operations and other atomics make optimum use of the Quadrics QsNet network adaptor (called Elan). Some unnecessary communication protocol layers have been removed to improve the communication speed. The use of this library reduces communication latency by an order of magnitude over optimised MPI or PVM implementations on all Cray and SGI architectures, as it can be implemented very efficiently on globally shared and distributed memory systems. For portability reasons, a subset of these routines is also available on Cray's parallel vector processor systems like Cray T90. As this library interface is only installed on Cray/SGI systems, programs using shmem routines are not portable to other platforms. Supported operations include remote data transfer (put and get), atomic swap, broadcast, reductions, and synchronization operations. All the programmers with heavy, long running job use this very fast communication library to get most of the performance out of the T3E.
Uwe Harms
|