The Awful Truth
Munich 30 November 2000 Tim Hubbard brought good but also bad news concerning the genomes. Additionally he discussed the Ensembl Project, to develop a software system which produces and maintains automatic annotation on eukaryotic genomes.
The human genome is available and completely free - now. It is very large and comes out in fragments (working draft). Hubbard expects that there will be no definitive sequence for at least 3 years. People want to use the human genome now. The reasons for accessing the genome are quite different. The *Bench" biologist looks whether his gene has been sequenced, what are the genes in this region and connects the genome to other resources. The research bio-informatics wants a data set of human genomic DNA and protein data set.
Ensembl Project
Ensembl is a joint Sanger and EMBL-EBI project and will develop an automatic annotation system for large eukaryotic genomes. Then the researchers are able to analyse new sequence data within 24 hours. Ensembl services will be:
- Pre-calculated genome analysis: location of genes; and integration with other genomic data
- Interactive services: Blast sequence search services; and interactive web displays
Ensembl Funding
Wellcome Trust announced in July a grant to Sanger and EMBL-EBI of 8 million British Pound over 5 years. The group will therefor expand to 30 people. The compute resources will be increased to 100s of dedicated CPUs - one step is the 320 DS10L compute farm.
Ensembl as Open Source
The human genome is too complex for any organisation to have a monopoly of ideas or data. Tim Hubbard presented the Ensembl approach:" Be as open as possible, at every leel. Like in the IT open source community, they use open standards:
- open cvs repository cf. Linux
- relational database;
Downloadable flatfiles;
Online schema
- object model;
Standard interface make it easy for others to build custom applications on top of Ensembl data
- open discussion of design
Many companies and academics are represented on the mailing list, and the code is being actively developed externally. In the comparative genomics the requirements are extreme. The genomes - human, mouse, zebra fish - are large. It is effectively a squared problem. The estimate for mouse/human is around 40,000 CPU days. This means 100 CPUs running 100% of the time all the year.
The Ensemble Project can be reached at http://www.ensembl.org.
Uwe Harms
[News on Advanced IT]
[Calendar]
[Analysis]
[IT in Medicine]
|