Navigation
© The HOISe-NM Consortium 1997
|
| ||
Experiences with a Tflop/s machine - ASCI Red: number 1 in the TOP500Mannheim, 19-6-97 The Accelerated Strategic Computing Initiative (ASCI) focuses on advancing three-dimensional, full-physics calculations up to "full-system'' simulation - applied to virtual testing. Involved are Sandia, Los Alamos and Lawrence Livermore National Laboratories. ASCI relies on the use of massively parallel supercomputers initially capable of delivering over 1 Tflop/s to perform such demanding computations. The ASCI "Red'' machine at Sandia consists of over 4500 nodes (over 9000 processors) with a peak computational rate of 1.8 Tflop/s, 567 GBytes of memory, and 2 TBytes of disk storage. There are many new issues to consider in the use of MPPs in a "production'' environment, e.g. parallel I/O, mesh generation, visualisation, archival storage, high-bandwidth networking and the development of parallel algorithms. Mark Christon discussed the issues and lessons learned to-date on the ASCI Red machine The ASCI program The Accelerated Strategic Computing Initiative (ASCI) is a multi-laboratory program, sponsored by the United States (U.S.) Department of Energy. It is focused upon the creation of advanced simulation capabilities that are essential to maintain the safety and reliability of the U.S. nuclear stockpile. This program clearly stimulates the US computer manufacturing industry. The program was initiated with the procurement of the "Red'' machine. The MP-Linpack record of 1.05 TFLOPs was achieved on the "three-row'' configuration in December, 1996. Additionally, two 0.1 Tflop/s systems have been delivered as a prelude to the two "Blue'', 3 Tflop/s computers (IBM SP systems). The ASCI Red machine is a distributed memory machine, while the ASCI Blue machines use clusters of symmetric multiprocessors. The ASCI Red architecture and its environment The fully configured ASCI Red machine will consist of 4536 computational, nodes with 567 GBytes of memory (64 MBytes/processor) plus 32 service, 36 I/O, 2 system and 6 network nodes, totalling 9224 Pentium Pros. In its final configuration, the Red machine will consist of 85 cabinets in four rows (1600 square feet). The status as of March 1997 is 1.4 TFLOPs peak, 3536 nodes and 442 GBytes of memory. The compute nodes are comprised of dual 200 MHz Intel Pentium Pro processors with a total of 128 MBytes of memory. The inter-processor communication network is configured in a two dimensional, 2-plane, mesh topology with a bi-directional bandwidth of 800 MBytes/second. The cross-sectional bandwidth of the inter-processor network, 51 GBytes/second, is a measure of the total potential message traffic capacity between all nodes in the system. Interactive access to the ASCI Red machine is provided via service nodes that consist of two Pentium Pro processors that share 256 MBytes of memory. In contrast to the compute and service nodes, the I/O nodes consist of two Pentium Pro processors that share 256 MBytes of memory. Archival storage is provided by the Scalable Mass Storage System (SMSS) hardware, currently more than 27 TBytes of disk and uncompressed, high speed, automated tape storage. All SMSS systems are connected to Ethernet and ATM networks Software issues Two operating systems are installed, "TFLOPs Operating System'' is a distributed OSF UNIX that runs on the service and I/O nodes. It is used for boot and configuration support, system administration, user logins, user commands and services, and development tools. The Cougar operating system, a descendant of SUNMOS and PUMA, runs on the compute nodes. Cougar is a very efficient, high-performance operating system providing program loading, memory management, message-passing, signal and exit handling, and run-time support for applications languages. Cougar only occupies less that 500 KBytes of memory. Although ASCI Red is a distributed-memory, Multiple-Instruction Multiple-Data (MIMD) machine, it supports explicit and implicit parallel programming models. In the explicit model the data structures are manually decomposed and distributed among the computational nodes. The code executes on each compute node, and messages are passed using a message-passing protocol, i.e., MPI or NX. MPI and NX are layered on top of Cougar's communications protocol with optimal-performance for both protocols. In the implicit parallel model source level directives inform the compiler about the data structures that is to be distributed. It is supported via HPF (High-Pefromance Fortran) on the Red machine. C, C++, Fortran 77, Fortran 90, and HPF are supported on the compute nodes of ASCI Red. An interactive, parallel debugger is available with a character and an X-window interface. Applications issues for MPPs Christon presented several application programs in some detail. ALEGRA is used in the electromangetical field, a parallel, Arbitrary Lagrangian-Eulerian (ALE) code, written in a mixture of C++, C, and Fortran 77. The code can treat 1-D, 2-D and 3-D geometries and uses advanced second-order accurate, unstructured grid algorithms for advection and interface reconstruction. CTH is an Eulerian computational shock physics code capable of treating 1-D, 2-D and 3-D geometries, uses a logically structured grid and includes the capability to simulate both multi-phase and mixed-phase materials, elastic-viscoplastic solids, and fracture. CTH was used on the ASCI Red to simulate an asteroid crashing into earth. It involved a 1 km diameter comet (weighing about 1 billion tons) traveling 60 km/s and impacting Earth's atmosphere at a 45 degree angle. After 0.7 seconds, the comet hits the water surface. The comet and large quantities of water are vaporized by the tremendous energy of the impact (equivalent to 300 times 10 exp 9 tons of TNT). GILA is a computational fluid dynamics code for performing transient, viscous, incompressible flow calculations that require high resolution, unstructured meshes, including also a large eddy simulation model that has been exercised on exterior flow problems with Reynolds numbers as high as 10 exp 5. GILA ran on over 800 processors of the ASCI Red machine with parallel efficiencies ranging from about 75% to 95% depending upon the problem and quality of the mesh decomposition. MPSalsa is a three dimensional, massively parallel, chemically reacting flow code for the steady and transient simulation of incompressible and low Mach number compressible fluid flow, heat transfer, and mass transfer with bulk fluid and surface chemical reaction kinetics. Actually researchers at Sandia implement scalable, parallel, visualization tools, an isosurface generator with polygon-decimation and a parallel volume rendering system for structured volume data sets for example. Applications issues and lessons learned If one wants to obtain high FLOP rates on a massively parallel supercomputer the hierarchical memory access has to be considered carefully. This means programming for cache re-use, thread safety for compute nodes, consisting of symmetric multiple processors, as well as the appropriate use of domain-based parallelism. It is difficult with massively parallel machines to achieve a balance between compute power, I/O bandwidth, disk and memory capacity. The mesh generation, domain decomposition, and visualization are still primarily serial processes. Thus, simply scaling the analysis model for massively parallel supercomputers is not practical. The data explosion - I/O limitations Some very new aspects enter the programmer's life with MPPs. A transient, unstructured grid calculation with 10 exp 8 - 10 exp 9 grid points generates O(10) TBytes of data, far exceeding the O(1) TBytes disk capacity of the Red machine. The problem of finite disk resources is additionally compounded by limited parallelism in the I/O subsystem. The 18 I/O nodes have to service the requests of approximately 4500 compute nodes. Thus only a group of processors can perform I/O while the rest is idle. To obtain 1 GB/s I/O bandwidth, the parallel file system requires large blocks of data. Cwrite and fwrite functions provide the best performance, approximately 20 MB/s. The high-level libraries NetCDF and HDF do not deliver acceptable levels of performance. Efforts are underway to develop a high performance I/O library for parallel collective I/O operations to achieve the bandwidth of the system. Conclusion The ASCI Red machine at Sandia provides a significant resource for "virtual experimentation'' and design development. Many parallel codes have been developed and the use of parallel computing for production simulation of a wide variety of physics is becoming the norm. New infrastructure challenges arise with the ASCI Red machine and these challenges are being met with innovative algorithmic approaches and methods for treating simulation results.
|