"Grid-ready" LSF 4.0 released by Platform

Toronto 28 Feb 00 Toronto based Platform Computing released version 4.0 of its LSF suite of resource management tools. Most significant additions are extensive reporting and accounting facilities in the newly added LSF Analyzer. Beneath the surface, a lot of improvements concern scalability. Especially those make LSF "Grid-ready". Computational grids are a new paradigm of an Internet based infrastructure that can make computational power available to anyone at anytime. Resource management tools like LSF are needed on the Grid to direct computational traffic and to administer billing information on who uses which resource.

Systems like LSF are currently used to manage clusters of machines. Users can submit jobs and LSF takes care that they are executed on the right machine and that the computational load is balanced over the available resources.

Initially, clusters were in the size of tens of machines. Now clusters have grown to contain hundreds and even thousands of machines, with thousands or then thousands of jobs concurrently in the system. To analyse what is happening in a large cluster, for instance to see whether there are bottle necks in the resources, which projects take what amount of computing time, an analyser tool like LSF Analyzer is used. With the growing amount of data that is involved, the data collection and analysing itself can require considerable resources.

In LSF 4.0 an important enhancement is LSF Analyzer that is also capable of handling large environments. Also the reporting capabilities are an improvement. LSF Analyzer can generate graphical reports for instance per user, groups, clusters, machines. As such it can be a powerful tool for an ASP (Application Service Provider) to determine whether or not his customers get what they expect and where there is room for improvement of the service. Apart from the standard 21 pre-defined reports, Analyzer can generate data that can be used by standard business intelligence packages, such as Word, Oracle, and tab delimited plain text. Report generating is a process that can be scheduled, and hence can be performed at a time that is suitable and fits into the scheduling policy for the cluster.

There are also other improvements in the LSF 4.0 product suite aimed at better use of and for large clusters. "Scalability and Availability were of major concern to us in this release", said David Wilmering, director of product management at Platform. In 4.0, for instance we did make the Master Daemon, that controls the LSF cluster, multithreaded, which makes it easier to resume service in case of a malfunction." Also the dispatching of large numbers of jobs has been improved in version 4.0. "We have now the possibility to easily define groups of jobs, for instance a thousand jobs using the same programme, but each with slightly different parameters and submit them with one key stroke. LSF will intelligtenly schedule the whole group, creating no more files and programme images than necessary", elaborated Wilmering.

Availability has improved with LSF 4.0, in that it can now be reconfigured without bringing it down. Hosts can be added and removed, and scheduling algorithms changed, without bringing the system to a halt, which is very important for large clusters.

A cluster is controlled by a Master. Of course the Master could go down, or the machine on which it runs can get disconnected from the cluster. When this happens, all the other computers start a negotiation process to elect a new Master. This works nice with a small number of machines, but with thousands of them, it can take a while. "That is why in LSF 4.0, we have included the option to designate which hosts are allowed to enter into the negotiation process", Wilmering said, "When say only 20 hosts are allowed into the election process, it proceeds much quicker. When all twenty are down, you have a major problem anyway, and bringing up LSF is probably not your major concern."

In a real computational grid, there are many computing service centres, ASPs and potential users that all could potentially be in contact, exchanging information and doing (computational) work. How this exactly should work in real life is still a research topic. Large multi clusters within multinationals could, however, provide a glimpse of the future. Texas Instruments runs the largest LSF based "multi cluster". It consists of over 30 separate clusters which several thousands of machines, hooked up together in one big multi cluster. Each local cluster runs mainly its own jobs, but allowing others to use resources, with some restrictions of course, too. This way, not only each cluster is locally used efficiently, but also the work amongst the clusters is balanced to a high degree.

With all the LSF enhancements, Platform seeks to position itself as the "Application Resource Management" software provider of the computational grid era.

 


Ad Emmen

[News on Advanced IT]   [Calendar]   [Analysis]   [IT in Medicine]