logo

EnterTheGrid - PrimeurMonthly

EnterTheGrid - PrimeurMagazine is the premier Grid Computing and Supercomputing information source in the world. With PrimeurMonthly we provide you a free update with grid computing and supercomputer-news and in-depth analysis.

>PrimeurMagazine
>PrimeurLive!
>EnterTheGrid
>Analysis
>Backissues
>Calendar
>Subscribe
>Advertise
>Contact
Contents May 2004
HPCx Industry Day at CCLRC Daresbury Laboratory
Daresbury 05 April 2004 CCLRC Daresbury Laboratory hosted a one-day Forum in HPC for Industry. Over 50 people attended this event, made up of potential users of the HPCx service from the industrial sector, a selection of software and hardware vendors, academic researchers and Research Council officials with high-performance computing interests. The talks covered areas of computational engineering, life sciences, environment, materials and chemistry simulations. The focus was in reviewing the productivity impact that systems with sustained performances of 1Teraflop/s, 10Teraflop/s and even 100Teraflop/s, would have on industrial R&D applications. (Chris Lazou)
Advertisement
Visit our sponsors
Advertisement

The lofty objectives of the meeting were to introduce, raise awareness and demonstrate how Terascale class systems such as HPCx and successive generation systems can meet the challenges of industrial R&D; promote the skill-base available in HPCx for efficiently and effectively exploiting HPC systems, developing new scientific methodologies and simulation technologies; and, explore the scale and scope of potential commercial interest in the HPCx service and successor facilities.

Daresbury houses a large-scale system, named HPCx, owned by a consortium consisting of the University of Edinburgh, EPCC, CCLRC and IBM. It is the UK's premier academic research computing service. Its mission is to create a world-class organisation for leading edge capability class simulations requiring access to the highest levels of computational performance.

This is not an idle boast. An extensive study simulating amphiphilic fluids, uncovered some very novel behaviour, including the self-assembly of the beautiful liquid crystalline gyroid phase. This was run on the Reality in Phoenix, Arizona, using 1024 CPUs on the HPCx system plus 2048 CPUs at PSC, Pittsburgh and other resources, performing the biggest Lattice-Boltzmann simulation in the world and for its innovative results, was awarded a Gold Star, at SC2003 last autumn.

To give a flavour of computer resources needed for this type of application, here is an example from another field using the NAMD code. This is a large-scale molecular dynamics simulation of the interactions between a T-cell and an antigen-presenting cell, the TCR-peptide-MHC complex, using 96,796 atoms. For a 2ps simulation it requires 20,000 CPU seconds using 1,024 IBM P4 processors. A 1nanosecond simulation can be performed in 10 hours elapsed time instead of months and this means that progress on tackling large-scale Molecular Dynamics problems have been much faster than originally anticipated.

The HPCx system comprises 1,280 IBM Power 4 p690 processors with 1Terabyte of memory, the "Colony" switch and 18Terabytes of high-speed disk.

The hardware is being replaced with 1,536 IBM P690+, 1.7GHz, 1.5Terabytes of memory and the "Federation" High Performance Switch, HPS. This enhanced system with a peak performance of 10.44Teraflop/s is expected to be fully operational this June, promises to deliver about 6 Teraflop/s performance on Linpack, placing it for the moment, in the top ten most powerful academic research supercomputers in the world. How much of this potential performance is delivered to the user application as productive work is the biggest challenge facing computing service providers.

The final phase for this 85 million US dollars system is an upgrade in year 2006 to provide a 22Teraflop/s peak performance. By year 2006, a new system should also be in place, purchased under a new competitive tender, codenamed "HECTOR" and costing an estimate 200-300Million US dollars, amortised over six years.

In the Daresbury upgrade, OS software were also upgraded; for example, PSSP has to be replaced by CSM to run the GPFS file system on the new p690+/HPS hardware and the internal frame partitioning had to be changed from the 4x8-way LPAR to 1x32-way LPAR. Incidentally, the replacement of the "Colony" switch with HPS reduces latency from 10usec to 8usec, so the system should be more balanced than in the past.

However, to make full use of scalar parallel computers with thousands of processors, computational scientists and engineers are faced with the daunting task of addressing major challenges of managing memory hierarchy, as in the IBM P690 family of processors, of expressing and managing concurrency in their application codes and using optimisation techniques to achieve efficient sequential execution. With complex systems a small mismatch can easily be magnified to massive bottlenecks, which substantially reduce efficiency.

As in many other supercomputer centres, Daresbury set up an HPCx Terascaling team of computer scientists, to tackle the "performance gap" problem. The team led by Martyn Guest, collaborates with consortia developing large application codes targeting modifications to enable these codes to use 1,000 or more processors, "efficiently".

Martyn started his talk by listing some of the existing large-scale parallel computers, mainly in the Federal Labs and NSF sites in the USA, build with mainly IBM p-series commodity compute servers tied by relatively high communication fabric. According to Martyn, the planned 100Teraflop/s ASCI Purple, based on the IBM Power 5 processor seems to be a limiting plateau in the evolution of parallel scalar computer architectures of this type. Hence, the new PetaOPs architecture projects are looking at alternative paradigms. For example, the IBM BG/L project envisages a cellular architecture with 100 thousands or more CPUs and is intended to be for general purpose. Other approaches, with R&D funding from DARPA, include the Cascade project at Cray and new hardware developments by Sun Microsystems.

In the past 10 years, peak performance on supercomputers increased hundredfold; in the next 5+ years it is likely to increase by another 1000 times, but efficiency has declined from 30-40%, common in the 1990s and still common today on vector supercomputers from NEC, to as little as 5-10% on parallel scalar supercomputers of today and may decrease even further, 1-3% on future scalar machines. This is the so called "performance gap" crisis.

The biggest conundrum in large-scale computing is that an increase by O(N) of the size of the science/engineering problem to be studied, classical algorithms require O(N square) or even O(N cube) computation resources, to perform the simulation.

The research challenge is therefore for new software, implementing new algorithms matching present and future hardware, enabling scientific codes to model and simulate physical processes and systems at near linear scaling. This is a continuous challenge; as computer architectures undergo fundamental changes, numerical algorithms need to track them and scale linearly, to enable the use of thousands and even millions of processors.

Martyn went on to say, on present scalar IBM hardware, one is faced with managing thousands of CPUs and a memory hierarchy, using commodity DRAM with communication fabric far slower than access to local memory. Each CPU has registers, on and off chip caches, main memory and "virtual" memory (disks). Each level requires more time to access (latency) and has slower transfer rates (bandwidth). A parallel computer adds an additional level, that of remote main memory.

A programming model based on non-uniform memory access (NUMA) explicitly recognise this hierarchy, allows performance improvements on sequential algorithms to be applied directly to parallel algorithms, e.g. data blocking. NUMA algorithms are typically more efficient and easier to design than those based on MPI and/or OpenMP.

Another challenge is the expression and management of concurrency of the application. How much parallelism is needed and at what level of granularity? David Bailey (as early as 1997) showed that the minimum level of concurrency (Lc) needed to sustain a given level of performance, P, on a single processor is: Lc=P x Lm, where Lm is the memory latency on the processor node. For a TeraOps computer build with commodity (100ns) DRAM memory, Lc = 100,000.

A more detail analysis of the computer determines how much latency must be coarse or fine grain. For example, if each processor in the TeraOps has peak speed of 1GigaOp, then, within each processor the algorithm must provide a fine-grain concurrency of at least 100 to support 1GigaOp. The additional factor of 1,000 must come from coarse-grain parallelism.

Another question is how coarse is coarse-grain? Apply the same formula, where latency Ln, is now the latency for accessing remote memory and using 10us for current networks, 256MB/sec for bandwidth, for short messages experiencing full latency, Ln is 100,000. Large messages (average 1/bandwidth) which brings coarse-grain to ~ 312, on the IBM P4+, so on this system large messages must be the rule.

The above analysis tells us that fine grain parallelism is essential to obtain efficient serial execution, avoid mismatch between processor speed and that of memory subsystem. Granularity of coarse grain parallelism is determined by the ratio: "single processor speed divided by the average latency of remote memory references". Latency is determined by the characteristics of both the communication hardware and the algorithm used.

Given the different objectives of fine and coarse grain parallelism, is it reasonable to combine them into a single programming model? The above discussion was dominated by considering different levels of memory hierarchy, so that same mechanisms (e.g. blocking of data to increase reuse) apply to both. Currently portable parallel environments do not provide this unity. Manually expressed coarse grain parallelism (e.g. with MPI or Global Arrays), rely upon compilers or libraries (e.g. BLAS) to take care of fine grain parallelism.

Using the theoretical background described above, Martyn and his team identified and selected a number of large-scale application codes for the full treatment. These included codes that are insensitive to the communication fabric, e.g. computational engineering and DNS methods, environmental modelling - POLCOMS, codes requiring concurrency management, i.e. migration from replicated to distributed data, DL-POLY (molecular simulation), CRYSTAL (electronic structure), optimisation of communication collectives (e.g. MPI_ALLTOALLV and CASTEP), memory driven approaches, scientific drivers suited to capability computing, enhanced sampling and replica methods and so on.

Their preliminary results show that they were able to improve the codes to deliver 8-16% of the peak performance. The maximum performance achieved was 1Teraflop/s for a system consisting 1,000 atoms SiC super-cell, (256x256x256 Mesh) run on CPMD, developed at IBM Zurich from the original Car-Parrinello code in 1993 and using all 1280 IBM P4 processors of the HPCx system. The mixed approach was instrumental in obtaining these results; larger SMPs and faster switches should deliver better results.

Apart from the examples mentioned above the speakers presented results from CFD simulations, enzyme, cell membrane protein interactions and some of the work done in developing new software technologies, reformulating the classical quantum mechanical methods to linear scaling. This new method, incorporated in the Cambridge Serial Total Energy Package (CASTEP), claims to enable simulation of 20,000 atoms in less than a day, on an HPCx phase2 size system

The results presented in this forum demonstrate the importance of capability computing, enabling scientific simulation practitioners to tackle larger more realistic problems and reducing the time to completion by at least an order of magnitude (from say a year to one month), bringing it within a "reasonable" timeframe. From the evidence, vector parallel systems are way ahead in the productivity stakes, at least an order of magnitude more productive than their scalar cousins.
Advertisement
Advertisement
Chris Lazou

EnterTheGrid - PrimeurMagazine

James Stewartstraat 248

1325 JN Almere

The Netherlands

http://EnterTheGrid.com

mailto:primeur@hoise.com

© EnterTheGrid - PrimeurMonthly