Visiting the Sanger Centre and Wellcome Trust

Munich 30 November 2000Although it was late in November, the Sanger Centre, located in Hinxton Hall near Cambridge, in the so-called Genome Campus, presented itself from its sunniest side. It is a world leading genomics research centre, primarily funded by the Wellcome Trust, and plays a key role in sequencing the human genome. The scientists at Sanger have generated more finished genomic sequences than any other organisation, public or private. The centre moves forward to play an equally important role in the post-genomic era, in functional genetics and informatics. Dr. Richard Durbin, working at Sanger, gave a short introduction, describing the centre, the international activities, and the actual projects.


The Sanger Centre was founded in 1993 and moved into its new, purpose-built main building in 1996. It forms a part of the Wellcome Trust Genome Campus, which includes the European Bioinformatics Institute (EBI), an outstation of the European Molecular Biology Laboratory, EMBL, and the UK Human Genome Mapping Project Resource Centre (HGMP-RC), which is funded by the UK Medical Research Council. The Human Cancer Genome Project is also located at Sanger. Sanger's objective is to further the knowledge of the biology of organisms, through large scale sequencing and analysis of their genomes in particular.

The Sanger Centre is a major contributor to the Human Genome Project (HGP), the international collaboration to decode the human genome. It is responsible for about one-third of the sequence data for HGP, working on chromosomes 1, 6, 9, 10, 13, 20, 22 (finished) and X. Sanger also sequences the genomes of pathogens (disease) and model organisms as well as providing in vitro and in silico data for gene expression.

Currently, about 570 staff people work on projects from Streptococcus equi to the human genome. Two-third of this work principally is done on human genome sequencing projects. A genome forms a complete set of inheritable instructions (i.e. genetic instructions) required to make an organism. The instructions are genes or functions. They are contained in chromosomes, which are long chains of deoxyribonucleic acid.

Richard Durbin discussed the molecular genetic paradigm. The biology and medicine of human beings is determined by the interaction of the products of our genes with each other, and with the environment. Therefore, if we know all the genes, we can approach biology and medicine from two sides: genes and molecules on the one hand, and observed outcomes on the other. The products of our genome interact with each other and with the environment and pathogens.

The Human Genome Project started in concept in 1985, in principle in 1990, and in earnest in 1995. The sequencing of three billion bases constituted an unprecedented technical as well as logistical challenge for biology. The international public and charity funding agencies are committed to provide the sequence in the public domain for free research and commercial use. It is accessible from all over the world and now researchers have to find the meaningful bits. Today, 90 percent of the human genome sequence is now publicly available, it is usable and widely used at present.

Thirty percent of the sequence is in complete form with a high accuracy archival reference quality. The centre aims to complete essentially the entire genome to this standard by 2003. The team is proud of the quality of the data, there is hardly one error in 10.000/100.000 bases. Richard Durban listed some other Sanger Centre research programmes:

  • the pathogen sequencing programme
  • the human genetic programme to study the genetic variation (SNP) and find disease genes
  • the cancer genome project
  • informatics including support of data collection; analysis and present results; and development of methodology, algorithms and data resources
In a typical month, the centre generates about 30.000.000 bases per day of human raw sequence. This is the basic data which involves overlaps and repeated sequences. The capacity is about 100.000 reads per day between all of the machines. That means an additional 40 Gbyte storage per day.

The genomic information can be used to identity genetic factors which are involved in common disease, assessment of individual risks, development of new drugs, better diagnosis, such as tumour classification, and increase the understanding of basic biological science. In the human genetics, one is able to find genetic variations and use these to discover human genetic factors in disease. In cancer, one wants to find genetic changes that take place during cancer, to aid in diagnosis and cure.

Richard Durbin mentioned the strong relationship of disease and genetics. Asthma, for example, has a heritability of 60 percent; with bone mineral density, the relationship is 60 - 80 percent; and with Insulin-dependent diabetes mellitus, it is 50 - 90 percent. These figures result from the heritability of common complex traits and diseases from twin studies. Over 100 genes involved in disease have so far been identified.

In the Sanger Pathogen Programme, the centre sequences pathogen genomes and starts functional studies based on the sequences. Microbial pathogens have genomes that can be, or have been sequenced. New genes are target for new drugs, that is if the protein is specific to the pathogen. Those genes, that are involved in the interaction between host and pathogen, are candidates for vaccines. Sanger actually has completed the sequence of M. tuberculosis (TB), typhi, jejuni (food poisoning), leprae, miningitis and pestis, and is still sequencing 18 other bacteria and parts of seven protozoa, including malaria and sleeping sickness.

At present, Sanger is studying the genetics of model organisms, for example the mouse as a model mammal, the zebrafish as a model vertebrate and the nematode worm as a very simple animal. Sanger Centre is a member of the SNP Consortium. The SNPs stand for single nucleotide, meaning single base, and polymorphism, which refers to more than one form. The consortium consists of 13 companies, for example Beecham, Novartis, IBM, Motorola, the Wellcome Trust and 5 research centres. The aim is to find points in the human genome where there exists a variation in the population of SNPs. The primary targets have been met six months earlier.

The information technology supports the data collection and the analysis, and presents the results. It has to develop methods, algorithms, and data resources. Thus, the informatics specialist has to build databases and tools to help to manage this information. Ensembl is a new computational project to identify the genes in the human genome sequence and keep them connected to other information resources as the knowledge develops.

The other problem is the endless amount of data, consisting of 3 GBytes for reference human genome sequences, 200 GBytes for analysis results, 3.2 TBytes for mouse genomes and 10 TBytes for human genome skims. Today, genomics science requires High-Performance Computing (HPC), as the rate of the acquisition of genomic data increases fourfold per year, 2 MBytes in 1995 to 3 TBytes in 2000. The future needs more compute power, more sophisticated data management, and better algorithms for real engineering.

Uwe Harms

[Medical IT News][Calendar][Virtual Medical Worlds Community][News on Advanced IT]