To connect the systems, a Hitachi SR8000 at Tsukuba/Japan, and Cray T3E systems at Pittsburgh/USA (PSC), Manchester (CSAR) and Stuttgart (HLRS), several ATM Private Virtual Circuits were to be set-up. These PVC's should provide a guaranteed bandwidth and latency. Something you will not get with standard Internet connections. The latency ranged from 20 msec for the HLRS-CSAR connection to 78 msec for the HLRS-PSC connection. Bandwith ranged from 0.5 Mbit/s for the CSAR-PSC connection to 1.6 Mbit/s for the HLRS-PSC connection.
Although the total computing power of the combined systems was over 2 Tflop/s, the relatively slow connections made special programming techniques as latency hiding necesseray.
One of the main problems that Costen reported was the enormous effort that was needed to set-up the connectons. For instance the Manchester-Stuttgart connection took two weeks and over 100 e-mails. Despite hard work, the UK academic network organisation did not succeed in getting the connection between teh Uk and the USA up and running until one day after the SC99 conference was closed.
The centres used a special MPI implementation developed by HLRS to run applications in parallel. Apart from the relatively slow connections between the machines, also the difference in architecture and even the performance differences between nodes of the Cray T3E machines posed problems. This load balancing problem were left to the application to solve.
One of the applications that was run on this combined metacomputer, was a Computational Fluid Dynamics Code that simulated the Crew-Rescue Vehicle of the new international space station. It used 1536 nodes on the Cray T3E's in Manchester, Stuttgart and Pittsburgh.
Other tested applications included the Jodrell Bank De-dispersion Pulsar Search Code, and a Molecular Dynamics code.
According to Costen, the set-up showed applications can be succesfully run over a distributed metacomputer. Hence, it will be further developed. It is planned for instance, to extend the Jodrell Bank Pulsar Code so that experimental data can be processed as the pulsar measuring experiment is in progress rather than storing it and post-process it later.. This way, when the experiment needs recalibration or intervention, this can be detected immidiately, rather than wasting precious time on a valuable resource.