Fast Genomic Sequence Search

Munich 30 November 2000 One of the actual projects concerned the fast search and all the computational requirements. Jim Mulliken from Sanger discussed the topics background, the SNP Consortium and the assembly of the whole genome.

The human genome draft sequence contains more than 3,000,000,000 bases, denoted with the letters A, C, G, T. Given a new sequence, the problem is: where does it match the draft? Thus Mulliken's team developed SSAHA (Sequence Search and Alignment using Hash Algorithm) to find matches quickly. It is good for locating highly similar matches, but needs lots of memory. Compaq's Tru64 Unix allows them to load the entire genome into one computer. This gives the best performance.

Single Nucleotide Polymorphism

The SNP Consortium generates 8 million 500bp sequences. The SSAHA-SNP align these to the human genome in 16 hours on one Compaq Alpha CPU with 16 GB main memory. This is fast enough to allow re-analysis of the entire dataset as more of the genome becomes available. A parallel version is being developed to use all 4 CPUs in an ES40 with 16 GB to attain better performance.

Sequence Assembly of the Whole Genome

In the meantime the shotgun sequencing of large genomes is becoming common. A 10x shotgun sequencing of the 35Mbp Malaria genome generates 700,000 ~500bp sequences. SSAHAAssemble assembles these sequences in one hour on a single Compaq Alpha CPU within 1 GB memory. In the future they will assemble larger genomes. This includes the mouse genome, which is about the same size as the human genome.

The SNP Consortium can be found at http://snp.cshl.org/index.htm.


Uwe Harms

[News on Advanced IT][Calendar][Analysis][IT in Medicine]