Book of life further unravelled in ultra-rapid sequencing with Ensembl software

Munich 30 November 2000Since June 26, 2000, the human genome has been completed and is freely available. The "book of life" is very large and still comes out in fragments as a working draft of 3.1 billion letters. Tim Hubbard from the Sanger Centre for genomics in Hinxton Hall expects that there will be no definitive sequence for at least 3 years. However, scientists want to use the human genome now. The reasons for accessing the human genome are quite different. The "Bench" biologist looks whether his gene has been sequenced, which are the genes in this particular region, and connects the genome to other resources. The research bio-informatics wants a data set of human genomic DNA and a protein data set.

Advertisement

The human genome draft sequence of more than 3,000,000,000 bases, denoted with the four letters A, C, G, T, representing the nucleotides which make up DNA in humans and other species, has already piled up over 22 terabytes of data on the Sanger Centre's hard disk drives. It is expected to rise to 50-100 terabytes within 2 or 3 years as researchers investigate how the three-billion-piece "parts list" embedded in our chromosomes determines the way we develop, age and fall victim to disease.

Jim Mulliken, one of Tim Hubbard's colleagues at Sanger Centre, discussed the Single Nucleotide Polymorphism (SNP) Consortium and the assembly of the whole genome. Given a new sequence, the problem is: "where does it match the draft?" Thus, Jim Mulliken's team developed the Sequence Search and Alignment using Hash Algorithm, called SSAHA, to find matches quickly. It is excellent for locating highly similar matches, but needs lots of memory. Compaq's Tru64 Unix allows Sanger to load the entire genome into one computer. This gives the best performance.

The SNP Consortium generates 8 million 500bp sequences. The SSAHA-SNP aligns these to the human genome in 16 hours on one Compaq Alpha CPU with 16 GB main memory. This is sufficiently fast to allow re-analysis of the entire data set as more of the genome becomes available. A parallel version is being developed to use all 4 CPUs in an ES40 with 16 GB to attain better performance.

In the meantime, shotgun sequencing of large genomes has grown common. A 10x shotgun sequencing of the 35Mbp Malaria genome generates 700,000 ~500bp sequences. SSAHAAssemble assembles these sequences in one hour on a single Compaq Alpha CPU within 1 GB memory. In the future, the Sanger Centre will assemble larger genomes. This also includes the mouse genome, which is about the same size as the human genome.

In turn, Tim Hubbard discussed the Ensembl Project, in which a software system is developed which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is a joint Sanger and European Molecular Biology Laboratory (EMBL)-European Bioinformatics Institute (EBI) initiative. It will enable the researchers to analyse new sequence data within 24 hours. Ensembl services will consist of:

  • Pre-calculated genome analysis: location of genes; and integration with other genomic data
  • Interactive services: Blast sequence search services; and interactive Web displays

Wellcome Trust already announced a grant in July 2000 to Sanger Centre and EMBL-EBI of 8 million British Pound over five years. The group will therefore expand to 30 people. The compute resources will be increased to 100s of dedicated CPUs. One step is the 320 DS10L compute farm.

The human genome is too complex for any organisation to have a monopoly of ideas or data. Tim Hubbard presented the Ensembl approach as being as open as possible at every level. Like in the IT open source community, the project uses open standards:

  • open cvs repository cf. Linux
  • relational database; downloadable flatfiles; on-line schema
  • object model; standard interface making it easy for others to build custom applications on top of Ensembl data
  • open discussion of design

Many companies and academics are represented on the mailing list, and the code is being actively developed externally. In the comparative genomics, the requirements are extreme. The genomes of human, mouse, or zebra fish, are large. It is effectively a squared problem. The estimate for mouse/human is around 40,000 CPU days. This means 100 CPUs running 100 percent of the time all the year.

More details on the the SNP Consortium can be found at http://snp.cshl.org. The Ensembl Project can be reached at http://www.ensembl.org.


Uwe Harms

[Medical IT News][Calendar][Virtual Medical Worlds Community][News on Advanced IT]