In 2001, Novo Nordisk's annual turnover amounted to 3198 million euro. The R&D budget involves 534 million euro for the intensive application of biotechnological techniques such as gene arrays, proteomics, metabonomics, and so on. Dennis Madsen described how proprietary as well as public domain data is used in different tasks ranging from gene mining, the annotation of sequence analysis to protein structure analysis. In the end, the various data sources have to be integrated into a database federation.
Needless to stress that these procedures imply gigabytes of data processing for biological sequencing and structure databases. In contrast with the biological sequence annotation which is computationally relatively light using BLAST, the task of gene mining demands far more computing power, although this is not a common task at Novo Nordisk, according to the speaker.
At present, the company has a Linux cluster at its disposal. However, the data amount is doubling every six to nine months which challenges the current computational resources in such a way that solutions need to be sought in other directions, like for instance the load balancing of batch jobs. Each batch job consists of many sub-jobs. At Novo Nordisk, few users are submitting jobs and a job re-run poses no problem.
The Linux cluster strategy addresses user queries by taking full control of N compute nodes, as Dennis Madsen showed. The sequence databases are split into N parts. Each part is distributed to the individual nodes. The same job is run on alle nodes while the server is pushing the different jobs onto the nodes. This should be changed to active pull by the nodes. The common result is stored on a server disk. The scalability limits of this system are unknown.
As far as Grid applications are concerned, commercial solutions already exist. The sandbox strategy is a characteristic example. Here, idle computer time of desk top users within the organisation is applied without their actually knowing for which tasks and purposes this idle computer time is being used. The network traffic is controllable. The jobs are designed for the node's compute power and are "up-time". This Grid approach performs well for jobs that are compute intensive and require minimal data but demonstrates rather poor performance for sequence searches.
The sandbox approach however has no control over the fact whether the nodes are up or down. Some people indeed tend to switch their computers off during lunch time. A "divide and conquer" policy on the other hand does not depend on an "all-nodes-up" strategy. The databases are updated each week whenever the network traffic is activated through the regular desk top user activity. The issue remains whether to use portable or stationary computers in order to dispose of active nodes, since many people are turning towards portable computers nowadays for meetings etc.
Dennis Madsen concluded his talk by stressing that technology is not enough to solve the computing power problems of health care companies, since a lot of other factors are involved like for instance the willingness of people to lend their idle computer time for scientific purposes. Still, he thinks the problem might be solvable and in a certain way may be already solved in these cases where idle computer power is used successfully. In other applications, this type of Grid solution may always remain just a dream.