|
In the technology field, vendors are naturally likely to enthuse on the merits of their new products. The usual suspects are expected to claim their transistor-based systems are delivering optimal solutions for various application domains. Similarly, high-capability systems, based on high memory bandwidth and inter-processor communication, offered by NEC and Cray would highlight productivity, higher sustained performance for large complex applications, as their forte.
Some others would argue that the HPC business battlefield is shifting to the design of specialised chipsets, using the Intel Itanium family of chips, or, in building clusters with other commodity chips using third party networks and the capability top end is of marginal importance? Questions such as these should partially be answered when Hans Meuer and Erich Strohmaier, describe the latest TOP500 list.
In the application areas many scientific technical presentations are concentrated into sessions. One of these sessions deals with requirements for HPC Operating Systems. There are some new developments, concerning operating systems for HPC, which are challenging conventional wisdom. In the late 1990s, the Cornell Theory Center (CTC) embarked upon a unique enterprise in the area of high performance computing. Instead of following the mainstream and adopting some form of Unix/Linux as its operating platform, CTC made a conscious decision and chose Windows NT Server.
Dr. Gerd Heber, senior research associate in the Cornell Fracture Group at the Cornell Theory Center, Ithaca, New York, USA, is giving a talk about their experiences. In an interview, I asked Gerd a series of questions and below is an extract from that interview, a few of the answers to give you a flavour of his talk.
Question 1: The conventional wisdom is to use highly tuned operating systems with rich functionality, but focused in scientific technical applications as a vehicle for future developments, yet you at Cornell chose Windows NT server. Would you briefly explain the rationale behind this decision?
Unlike operating systems for real-time processing or embedded systems, the operating systems used in scientific and technical computing are fairly general purpose. Except for say I/O operations or multithreading, any OS is more or less "in-the-way" of an application. The systems in the Windows Server family are tuned enough to get a cluster on to the Top500 list or to the top of the TPC benchmarks. Tests performed by us in house or by others have shown that on identical hardware, applications running on a Windows Server based platform will perform the same or better than on other operating systems.
Question 2: As I remember in the mid-nineties Cornell was one of the NSF supercomputing centres, operating the largest IBM SP system in the world. A switch to Windows could not have been easy and how did your users react to this radical departure from the mainstream?
The switch was indeed not easy and the prevailing initial reaction from users (including myself) was skepticism, in some cases hostility. From a user's perspective, what made it difficult were the different development environment and the lack of certain libraries and tools.
For example, we had a considerable investment into project management using make and CVS, and we had no intention to change this. The C++ compiler, which shipped with Visual Studio 6.0, 5 years ago, was a decent C compiler, but did not deserve to be called a C++ compiler. After years of development with Kuck & Associates' KCC compiler, we spent several months looking around for and testing all kinds of C++ compilers, just to get our code built for Windows. Fortunately, the code had been tested extensively on the SP and performed correctly once built. Tools like Cygwin and GCC were indispensable in those early days.
Question 3: As one is aware multi-scale systems throw many different problems to those of medium size systems, what for example were the key challenges while developing Windows NT for its new role?
Making sure that all necessary things are in place for users, developers, and administrators was perhaps the key challenge. There were quite large Windows installations (Windows domains of thousands of servers) out there when we made the transition, but running them as HPC clusters with all its specific requirements had not been attempted yet. On the other hand, the administrative staff at CTC consisted of fabulous AIX administrators, who had only a vague idea of doing things the "Windows way". What followed was for both users and administrators a "soul searching process" at the end of which both arrived at the same conclusion: "You must change your life." (Rilke) - Emulating or mimicking UNIX on Windows gets you only so far.
Question 4: To successfully solve multi-scale, multi-physics application problems, system robustness, fault tolerance and computation integrity (checkpoint restart) are essential. For example in a system with 2000 nodes if a node fails once every two years the system fails about three times a day and most calculations take much longer than that, so how are you tackling these challenges?
All our work is done using industry standard soft- and hardware. Despite the lack of out-of-the-box support for check pointing we think application-level check pointing is the most promising approach. Of course, this requires more or less intervention from the user, but at the same time minimizes the amount of data associated with a checkpoint. The Intelligent Software Systems project at the Cornell CS department implemented a very convenient compiler based-approach for MPI and OpenMP programs. In my own work, I use databases to store sufficient application state- and history information, which also allows restarting on a different number of nodes.
Answers to questions 5, 6, 7, 8, 9......., would be published after the talk is given at ISC2004.
The other sessions, at ISC2004, are also of great interest that is why I'll be there in June. So, I urge you to register now and participate in this august conference. Also, Heidelberg is a pleasant city, the food is good and the conference social events are fun. The abstracts of all presentations, details of the programme, profiles of speakers and on-line registration with special advance registration fees can be found at http://www.isc2004.org |