The human genome draft sequence of more than 3,000,000,000 bases, denoted with the four letters A, C, G, T, representing the nucleotides which make up DNA in humans and other species, has already piled up over 22 terabytes of data on the Sanger Centre's hard disk drives. It is expected to rise to 50-100 terabytes within 2 or 3 years as researchers investigate how the three-billion-piece "parts list" embedded in our chromosomes determines the way we develop, age and fall victim to disease.
Jim Mulliken, one of Tim Hubbard's colleagues at Sanger Centre, discussed the Single Nucleotide Polymorphism (SNP) Consortium and the assembly of the whole genome. Given a new sequence, the problem is: "where does it match the draft?" Thus, Jim Mulliken's team developed the Sequence Search and Alignment using Hash Algorithm, called SSAHA, to find matches quickly. It is excellent for locating highly similar matches, but needs lots of memory. Compaq's Tru64 Unix allows Sanger to load the entire genome into one computer. This gives the best performance.
The SNP Consortium generates 8 million 500bp sequences. The SSAHA-SNP aligns these to the human genome in 16 hours on one Compaq Alpha CPU with 16 GB main memory. This is sufficiently fast to allow re-analysis of the entire data set as more of the genome becomes available. A parallel version is being developed to use all 4 CPUs in an ES40 with 16 GB to attain better performance.
In the meantime, shotgun sequencing of large genomes has grown common. A 10x shotgun sequencing of the 35Mbp Malaria genome generates 700,000 ~500bp sequences. SSAHAAssemble assembles these sequences in one hour on a single Compaq Alpha CPU within 1 GB memory. In the future, the Sanger Centre will assemble larger genomes. This also includes the mouse genome, which is about the same size as the human genome.
In turn, Tim Hubbard discussed the Ensembl Project, in which a software system is developed which produces and maintains automatic annotation on eukaryotic genomes. Ensembl is a joint Sanger and European Molecular Biology Laboratory (EMBL)-European Bioinformatics Institute (EBI) initiative. It will enable the researchers to analyse new sequence data within 24 hours. Ensembl services will consist of:
- Pre-calculated genome analysis: location of genes; and integration with other genomic data
- Interactive services: Blast sequence search services; and interactive Web displays