logo

EnterTheGrid - Primeur Live!

EnterTheGrid - Primeur is the premier Grid and Supercomputing information source in the world.

>Primeur Magazine
>PrimeurLive!
>EnterTheGrid
>Analysis
>Backissues
>Calendar
>Subscribe
>Advertise
>Contact
News digest 24 June 2004
>Start
>PrimeurLive! from ISC2004 in Heidelberg
>Blog
>Germany lost
>Cray is back
>Dongarra analyses Tflop/s systems
>Camp, Weber and Red Storm
>Mutter aller Rechner
>TOP500
>Terascale Computing Facility at Virginia Tech to optimize operating environment on system X
>How will the supercomputer systems and their interconnects of tomorrow differ from their current counterparts?
>Hardware
>The world of storage using parallel file systems
>Red Storm: what is it and what about the AMD technology
>Applications
>Using Windows as an HPC operating system proves to be a benefit
>University of Tennessee researchers analyse process fault tolerance on HPC systems
>The space simulator is modelling the universe on a budget
>Company news
>PathScale EKO compiler suite certified as interoperable with Streamline Computing's distributed debugging tool
>Breakthrough HP technology yields up to 100 times more bandwidth for Linux clusters
>More than half of world's Top 500 supercomputers now running on Intel processors (Intel release)
>Voltaire made its debut on the TOP500 list with four supercomputer clusters
>Dolphin SCI Interconnect Selected for International Space Station Training Simulator
Terascale Computing Facility at Virginia Tech to optimize operating environment on system X
Heidelberg 24 June 2004

Dr. Srinidhi Varadarajan from the Terascale Computing Facility at Virginia Tech held a talk on their experiences with the operating environment on their System X. Since Virginia Tech is a core partner in the upcoming National Lambda Rail network in the United States, the institution has to provide high performance networking capabilities to tie its computational facilities into national computational Grids. Therefore, a Computational Sciences and Engineering (CSE) programme is needed as well as matching computational facilities to complement this CSE programme.

Advertisement
Visit our sponsors
Advertisement

The Terascale Computing Facility (TCF) already is based on cutting-edge technologies and is ready to operate as a liaison for Virginia Tech's research expertise, as the speaker explained. In this way, it will create mutually beneficial co-operations between the internal and external research communities, as well as IT support units. TCF will also support simultaneous use for both experimental and production research and collaborative multi-site research activities. Dr. Varadarajan stressed that this CSE programme has to be seen as a long term initiative and that the current supercomputer system will be followed by a second one in 2006.

The speaker outlined that Virginia Tech strived at a ten teraflops sustained performance for its CSE programme which explains why the selected hardware architecture has been baptised System X. The supercomputer had to achieve the highest performance possible on a limited budget. The total system's cost amounts to only $5.2 million, the facilities' upgrade to $1 million and the UPS and back-up power generators also to 1 million. As such, Dr. Varadarajan thinks the System X to be one of the cheapest supercomputers.

The architecture consists of 1100 Dual Apple G5 2GHz CPU based nodes with 8 Gflops/processor peak double precision floating performance and a peak performance of 17.6 Teraflops. The speaker explained that each node has 4GB of main memory and 160GB of serial ATA storage with 176TB of total secondary storage. Five head nodes serve for compilations and job start-up and there are four management nodes. Currently the system is being upgraded to a G5 Xserve 1U server platform.

As for the primary communications architecture, the system is based on Infiniband technology, provided by Mellanox, and uses a switched network in which each node connects into the network at 20 Gbps full duplex bandwidth. Twenty four 96 port switches are organised in a "fat-tree" topology providing 46+ Tbps switching capacity, as the speaker explained. The secondary communications architecture consists of a Gigabit Ethernet management backplane for control, job start-up and "typical" IP traffic. It is based on five Cisco 4500 series switches with 240 Gigabit Ethenet ports/switch and a managed fabric with integrated wire-speed IP routing engine.

Virginia Tech has a data centre of 9000 square feet with 3000 square feet reserved for the TCF and 1500 square feet destinated for warehouse and loading dock, as the speaker told the audience. It has a dual redundant network backbone. The Access Grid is located in the adjacent building. The TCF has received 1.5 MW of new power. The data centre has more than 2 million BTUs of cooling capacity using Liebert's extreme density cooling. The system uses rack-mounted heat exchangers with R-134A refrigerant and an overhead chiller unit.

As far as the software is concerned, System X runs Mac OS X stock 10.2.7 with MPI parallel communication libraries. Dr. Varadarajan explained that Apple provides optimized BLAS and FFT libraries. The TCF team is working with Kazushige Goto for full BLAS libraries. Goto's matrix multiply (DGEMM) efficiency is 84.1 percent and its routines powered the Linpack benchmark. Furthermore the softweare consists of C and C++ optimizing compilers and a Fortran 95/90/77 compiler.

Dr. Varadarajan also addressed some major operating system issues. The TCF team is working on the reduction of message transfer latency for which new memory management strategies are needed. Next, there is the question about choosing the best operating system for HPC platforms: should it be monolithic or microkernel? Another problem is the scheduling noise caused by distributed memory cluster architectures with independent schedulers on each node. In addition, the system has to be reliable but how can you make applications run on clusters composed of thousand of nodes and make them resilient to system failures at the same time?

The system provides a new cache-optimized memory manager for scientific applications, as the speaker explained. It is written as a kernel extension and provides contiguous physical addresses, thus improving TLB hit ratios. The Goto BLAS libraries use this to improve the performance of matrix operations. The memory manager also allows to allocate pre-pinned memory to reduce the latency of send and receive operations. The ongoing work consists in developing a complete OS memory manager which always provides pinned contiguous memory.

Dr. Varadarajan noted that the system's Linpack performance improved by 15 percent through a combination of memory manager optimizations. Common cases are handled in the bottom layers of MPI since the system does not yet have a memory manager that allocates pre-pinned contiguous pages. Therefore, to send a message on a zero-copy interconnect, one needs to pin the memory, send the message and unpin it. These pin and unpin operations are expensive. But if messages are sent from the same VM address, one can use an address cache and avoid the repeated pin/unpin operations. The TCF team modified the algorithm by adding new a LRU message cache. In addition, to optimize for MPI datatypes that cause a data copy within MPI, the team also added free list management by reusing freed data pointers with pinned pages.

The dilemma whether to use a monlithic kernel or a microkernel operating system is not easy to solve since both systems have a number of pro and cons, as the speaker outlined. In monolithic kernels, time and space are efficient but it is hard to extend existing functionality and to replace major portions of the operating system, such as for instance the scheduler. The microkernels on the contrary are easy to extend and it is easy to replace any portion of the system but the architecture uses messages leading to a huge amount of context switches.

In fact, the ideal HPC operating system has to provide only a basic layer of services including memory management, scheduling, file systems, and I/O with one application per processor. Applications just "know" the physical memory footprint and rarely swap. The two main tasks for the scheduler are the high priority application and a low priority system management task, according to the speaker.

Now both monolithic kernels and microkernels tend towards a "middle ground": monolithic kernels incorporate kernel processes which basically show a microkernel construct. Microkernels merge commonly used services into monolithic layers thereby replacing messages with function calls. Still monolithic kernel tasklets are not necessarily under scheduler control and they may not have their own address space.

On the solution of gang scheduling to avoid scheduling noise, Dr. Varadarajan told the audience that distributed memory cluster architectures run an independent operating system on each node. Independent schedulers on each node leads to scheduling noise meaning that the scheduling process is not synchronized. This can be solved by using a common scheduler across the entire system but gang scheduling needs a global clock, which is expensive on distributed memory cluster architectures. Concerning this subject, significant work on modelling scheduling performance on HPC platforms was done at the Performance and Architecture Lab (PAL) at Los Alamos National Labs by Petrini, Kerbyson, and Hoisie.

The speaker explained that OS X offers a significant advantage for an approximation for gang scheduling. OS X offers a near real time priority level. A task running at this priority can only be pre-empted by an interrupt. The HPC application runs at the high priority and cannot be pre-empted. An inbound message from a management node which which causes an interrupt from the Ethernet controller, is used to periodically give scheduling cycles to the system management task. The head node hence acts as a simple global clock for a primitive gang scheduling algorithm. Since scheduling control is critical, Dr. Varadarajan therefore believes that an HPC operating system design is easier to achieve with a base microkernel architecture.

To handle the reliability issue, the TCF team developed a transparent fault tolerance system called DÈjý vu for engineering reliability into large-scale supercomputers. Virginia Tech is leading the collaboration with PSC and ISR.

DÈjý vu is being ported to the G5 platform, and will be deployed at the TCF in a project funded by NSF. The system received a provisional patent and the work is being commercialized through California Digital Corporation. According to the speaker the DÈjý vu system has salient features including a kernel independent transparent checkpoint, recovery and migration system, and incremental and non-blocking checkpointing. It integrates user-initiated and system initiated checkpointing and supports process migration for failure recovery and administrative control. It is a new model to achieve global state consistency.

Dr. Varadarajan concluded his talk by telling the audience about the research topics the Computational Sciences and Engineering programme is currently addressing, being nanoscale electronics, quantum chemistry, computational and biochemistry, aerodynamics through multi-disciplinary design optimization, cell cycle modelling,molecular statics, computational acoustics, computational fluid dynamics, computational electromagnetics, optimal design and control, wireless systems modelling, micro-array experiment management, and large-scale Network emulation. The team also works on experimental systems to test fault tolerance and migration. This includes queuing systems, schedulers, distributed operating systems/DSM, parallel filesystems, middleware for computational Grids, and

authentication and security systems.

Advertisement
Advertisement
Leslie Versweyveld

EnterTheGrid - Primeur

James Stewartstraat 248

1325 JN Almere

The Netherlands

http://EnterTheGrid.com

mailto:primeur@hoise.com

© EnterTheGrid - Primeur Live!