Sanger Centre extends its Compaq Equipment by an additional $3.2 Million searching human genomes

Cambridge 18 Jan 00 The Sanger Centre, located in Hinxton, Cambridge, UK, was founded by the Wellcome Trust and the Medical Research Council. It is a genome research center which wants to further the knowledge of genomes. For these tasks they use about 250 clustered Compaq Alpha systems and large memory Alphas running TRU64 Unix operating system and StorageWorks storage subsystem as a high-availibility system for the huge amount of data, they produce. In December 1999 Sanger invested 3.2 Million US Dollar and installed another 83 Alpha processors and more RAID storage.

To fulfill the changing demands of the scientists, Sanger decided to realise a scaleable and flexible system, that means compute farms with system independent storage and loosely coupled clusters, a data network and data storage systems. Additionally the system has to be open for new, emerging technologies. The center supports more than 450 users, who access a clustered network of 250 Alpha systems (EV5 533 MHz) running Tru64 Unix, a multiprocessor 8400 with 12 EV5/6 processors serving as the large memory system with 4 GB (external BLAST - sequence similarity), and dual processor EV6 DS20. Additionally the center has installed a Linux-based a 48 node Intel Pentium farm for internal BLAST. All the components, the clusters, the RAID storage systems, the front-end compute servers and the different desktop working group systems are connected via ATM.

Using LSF (Load Sharing Facility from Platform Computing), the center builds an enterprise cluster with all 250 Alpha-Servers as a single Sanger compute engine. LSF balances the load of applications between the compute farms and the front end servers and allows a more efficient use and management of the system and the different applications.

Storage management is of great importance. The center uses two types of RAID devices, a Network Appliances Server with 300 GB, and the Compaq's StorageWorks storage subsystem with 4 TB. StorageWorks systems are highly available, dual heads, controllers, and power supplies - no single point of failure -, and are easily expandable. An other 200 GB are stored on non RAID systems. All 4.5 TB of data is completely backed-up over a 3 day cycle, incremental, full and on-line dumps. The data flow is as follows: gel ordering, sequencing, data extraction, assembly and the finished sequence. The computed gel files are stored in a repository and archived to tape, 6 TB, and nearlinig to tape, another 6 TB. About 0.75 TB are reserved for gel files, 3 TB for project data and 0.25 TB project specific. The storage demands increased from about 100 GB in 1994 to 400 GB in 1996, 750 GB in 1997 to 1.75 TB in 1998 and to 4.5 TB in 1999. As the teams scale up, they use larger datasets in the projects, 1 to 4 GB.

The huge amount of data is handled within a hierarchical and system-independent storage architecture, a separate storage area network (SAN). Storage and computing components have their own networks, which can be extended independently.

End of 1999 Sanger Center bought for 3.2 Million Dollar additional Compaq Alpha computers for the next major discovery. Last month the machines have been installed, 12 4-way Alpha Servers ES40 with an aggregated performance of about 50 GFlop/s in total, another two 2-way ES40 and 34 uniprocessors.The aggregated total peak performance at Sanger sums up to more than 370 GFlop/s. This would rank the Sanger Compaq Alpha complex in the actual top500 list in the fifties. Additionally the disk storage grew by 3 TByte.

The IT Future

The center plans to upgrade to Tru64 Unix version 5.0 and tightly-coupled clustering - parallel processing. Therefore it plans to install the Quadrics Switch and software with the ES40. As a result, the center willl move to multiprocessor systems with a much larger memory. An other topic is Fibre Channel and a cluster-wide file system.

Human Genome Project at Sanger

The human genomic research programme at Sanger Center maps, sequences, and interpretes the structural and functional task of a genome. The sequencing breaks the human Genome (DNA) into many small fragments then the order of the individual bases may be determined. It is over-sampled so that it can be matched and re-assembled in the correct order. Each human cell contains 23 chromosomes, unique to each individual, with a length of 0.001 mm across and the DNA double helix contained within is about 1 m long. As the Center is not profit-making, all the results are public. The scientist can access chromsome sequencing project information, sequence and map data. A very fast sequence similarity (BLAST) search service on a per-chromosome or whole genome basis is available at Sanger.

Human Chromosome 22 Sequence Completed

As reported in the magazine Nature 402, on 2nd December 1999, the sequence of human chromosome 22 has been completed in an international cooperation. This article can be found on Sanger's web site. Knowledge of the complete genomic DNA sequence of an organism allows a systematic approach to define its genetic components. The genomic sequence provides access to the complete structures of all genes, including those without known function, their control elements, and, by inference, the proteins they encode, as well as all other biologically important sequences. In the next step the scientists want to obtain the complete sequence of the entire human genome. The sequence obtained of Chromosome 22 consists of 12 contiguous segments spanning 33.4 megabases, contains at least 545 genes and 134 pseudogenes. This is the first time a human chromosome has been sequenced and includes the largest continuous sequence determined so far from any organism. The work gives scientists a real insight into the way genes are arranged along a strand of DNA and how they might be controlled. This opens the way for advances in medical diagnosis and treatment, as Chromosome 22 is implicated in the workings of the immune system, congenital heart disease, schizophrenia, mental retardation and several cancers including leukaemia. In the next step, the remaining 2 billion base pairs of DNA that comprise the rest of the human genome is to be decoded and to map the other 22 human chromosomes. Work continues to complete the estimated 3 billion letters - a factor of 100 compared to Chromosome 22 - that make up the whole of the human genome. A first working draft will be available by Spring 2000, and the highly accurate, finished form before 2003.

www.sanger.ac.uk

 


Uwe Harms

[News on Advanced IT]   [Calendar]   [Analysis]   [IT in Medicine]