Impact of groundbreaking Information Technology on the Sanger Centre

Munich 30 November 2000One of the most important aids in studying genomes is the use of high-end computers. Phil Butcher, who is head of Information Technology at the Sanger Centre for genome research, presented the enormous storage and computing challenges, as well as the actual application of clusters of Alpha processors.


One of the most important and critical issues is data storage. It grew from 3 TByte in 1998, 4 TByte in October 1999, 8 TByte in April 2000 to 22 TByte in total in November 2000. The capacity of RAID (Redundant Array of Inexpensive Disk) disk storage increased over the past years in line with the sequencing projects ramp up. To assemble the human genome, the scientists have all their archived and near-lined data on-line.

The other factor is the computer systems architecture. At Sanger, they built compute farms from the beginning. For many years, they utilised network storage before SAN/NAS (Storage Area Network/Network Attached Storage) became a popular acronym. They implemented loosely coupled clusters to maximise the use of all the systems and distribute the workload efficiently. From the desktop workgroup systems, the user distributes via LSF (Load Sharing Facility) his batch jobs onto the front-end compute servers or the compute server farms.

Phil Butcher mentioned some limitations. As NFS access is slow and data sizes are increasing, the store has to grow, and the scientists need larger memory configurations. The management overhead increases as the number of nodes increases. Now, they turn to Fibre channel/Memory channel Tru64 clusters. It is done for better disk I/O (fibre channel), scalability (multi-cpu, multi-terabyte) and improved manageability (single system image). Thus, whole clusters are managed as single entities.

All the projects have their specific machines, which can be used by others when they are idle. The project Ensembl recently installed a DS10L Alpha farm, consisting of small size, Pizzabox-like, rack mountable workstations. In 8 racks, there are 40 DS10L each, which sums up to 320 in total: 1 U high, EV6 (466 MHz), 320 GB memory and 19.2 GB internal disk. This forms the equivalent to 10 GS320 with a performance of 326 GFlop/s peak. It delivers capacity for more than 500.000 blast searches per day.

The blast farm consists of 440 nodes. There is a large-scale assembly and sequencing server, as server for SNP, mapping, informatics, sequence data processing, and pathogen. With LSF, Sanger has the capability to use many of the 700+ compute nodes as a "single" Sanger Compute Engine for modular supercomputing. The aggregated peak performance can be estimated in the range of 700 to 800 GFlop/s. With an estimation of 60 percent of peak, Sanger would be in the 30th to 40th rank in the actual Top500 list for supercomputers.

Phil Butcher listed the actual projects which all are compute intensive. Thus, Sanger Centre has to scale up by a factor of five and deal with the physical limitations. This will involve thousands of CPUs, a large number of PC farm nodes and also high-end, large memory SMP (shared memory processing) configurations. Additionally, they need 50 to 100 TByte of storage.

For the medium term, Mr. Butcher will replace memory channel interconnect by implementing 200 MByte/s Quadrics Switches, developed by Quadrics Supercomputer World Ltd. This system is built out of 8-way crossbar chips with a latency of 3 micro seconds end-to-end from user application. The network is capable of 256 SMP nodes. Phil Butcher will equally connect individual clusters into one switching network of Sierra Clusters.

The immediate future will see a storage area network, installing 7.5 TByte to enable disk mirroring and controller/controller snapshots. It will be connected to the Sanger facilities. Additionally, they will realise an institute- to-institute clustering and thus a closer collaboration between Sanger and the European Bioinformatics Institute (EBI), which brings the need for site-wide shared clusters.

For the long term, future wide area clusters are necessary for a large scale collaboration. The GRID technology, global distributed computing, will come and allow international cluster collaborations with other scientific institutes. Phil Butcher stated that the Sanger Centre is keen to keep abreast of this emerging technology of global compute engines. For an overview of Sanger Centre's activities, we refer to this issue's article Visiting the Sanger Centre and Wellcome Trust.

Uwe Harms

[Medical IT News][Calendar][Virtual Medical Worlds Community][News on Advanced IT]