C-DAC's GenomeGrid Portal enables researchers to smoothly run bioinformatics applications

Dresden 26 June 2007The International Supercomputing Conference, recently held in Dresden, Germany, this year organized a Scientific Day on June 26, with a series of selected research paper presentations, spread over two parallel sessions. Dr. Rajendra R. Joshi, Group Co-ordinator of the Bioinformatics Group at the Centre for Development of Advanced Computing (C-DAC) at Pune in India, in his session talk highlighted the building, infrastructure, functioning and use of the GenomeGrid Portal to the ISC'07 audience. Based on a four-layered architecture with a Globus-enabled Grid set-up, this portal provides easy access to distributed resources, supporting the analysis of genome sequence data by means of the Smith-Waterman algorithm among others.

Advertisement

Dr. Joshi introduced the audience to the importance of high-throughput techniques for the life sciences to support the sequencing of DNA, the analysis of gene expression with micro-arrays, the profiling of proteins via high throughput mass spectroscopy, the protein-protein interactions, and the whole-cell response. Between 1982 and 2005 the GenBank has grown to 56 gigabases from 52 million sequences. Currently, there are 467 complete published genome projects: 31 of these shed their focus on archaeal, 414 on bacterial, and 22 on eukaryal aspects.

The analysis of biomedical data is highly complex and is being performed on a large scale, ranging from the building structures of DNA sequences over three-layered protein structures, protein-protein interactions in the metabolism, to the level of human physiology characteristics that are studied in cellular biology, biochemistry, neurobiology, endocrinology and related disciplines. All this research leads to the creation of genetic maps to better understand the alignments in DNA sequences and to try and throw some light on the genetic variants in individual patients and the polymorphism which plays a role in epidemiology.

Given the fact that the analysis is stretching from genomes, gene products, their structure and function, over pathways and physiology, to the study of entire populations and their evolution, and eventually to ecosystems as a whole, the speaker explained how computational biology in the high-throughput era is faced with a series of multi-faceted challenges at the levels of science, algorithms, data integration, and computation. This involves a computational evolution from "trivially parallel" bioinformatics computing to the "massively parallel" computing of molecular biophysics data in complex systems at the Terascale.

To give the audience an idea about the computational and networking needs in bioinformatics, Dr. Joshi presented the following table:
Problem Component Computing Speed Network Storage
Genome Assembly >10 TeraFlops sustained to keep up with expected sequencing rates 155Mbs to 622Mbs 300 TB per genome
Protein Structure Prediction >100 TeraFlops per protein set in one microbial genome 622Mbs Petabytes
Classical Molecular Dynamics 100 TeraFlops per DNA-protein interaction 2.4Gbs 10s of Petabytes
First Principles Molecular Dynamics 1 PetaFlops per reaction in enzyme active site 10Gbs 100s of Petabytes
Simulations of Biological Networks >1 TeraFlops for simple correlation analyses ofsmall biological networks 100Gbs 1000s of Petabytes

As such, it becomes clear that single CPU is not sufficient to perform complex tasks including computationally intensive algorithms for bioinformatics; phylogenetic analysis for different species; multiple prokaryotic and eukaryotic genome comparisons; sequence analysis codes like ClustalW and Smith-Waterman for larger data sets; motif finding tools like MEME; proteomics study; and molecular modelling codes like AMBER. All these tasks require the use of parallel computing when we take into account that the GOLD on-line database has made available an enormous amount of genome data. In fact, 460 genomes have been completed and 1748 genome sequencing are in progress, according to the speaker.

This is where Grid computing comes in since it allows highly integrated, easy resource sharing with localized system administration. Especially in the case of bioinformatics Grid computing enables the sharing of bioinformatics data from different sites by creating a virtual organisation of the data. For tasks that require computational resources beyond those available to researchers at a single location, the Grid constitutes an ideal platform. Depending on the intensity of the code and dataset, the job can be submitted either on any of the Grid nodes or across several nodes, as Dr. Joshi stated.

To prove to the audience that Grid computing is surfing the waves in bioinformatics, the speaker cited a host of projects that are already benefiting from the power Grid holds in store:

  1. Open Bioinformatics Grid - OBIGrid - is a new infrastructure for bionetwork research by the Bioinformatics Group, RIKEN Genomic Sciences Center
  2. BioGrid is an inititative from the Ministry of Education in Japan with Osaka university and other relevant institutions to develop a computer Grid technology to meet the IT needs in the biology and medical science domains
  3. ThaiGrid is currently funded by the National Research Council of Thailand (NRCT), the Commission on Higher Education, and the Ministry of Education
  4. K*Grid in Korea is an optimized computing Grid system development for molecular simulation based bio- and nanomodelling simulations
  5. ApGrid is an Asia-Pacific Partnership for Grid Computing with potential partners from Japan, Korea, Australia, Thailand, Taiwan, Singapore, the United states, Canada, China, Hong Kong, Vietnam etc.
  6. TeraGrid includes a Biology and Biomedical TeraGrid Science Gateway provided by the Renaissance Computing Institute.

The GenomeGrid Portal developed by C-DAC at Pune in India is being deployed on the GARUDA Grid infrastructure. GARUDA is a collaboration of science researchers and experimenters on a nation wide Grid of computational nodes. It connects 45 research laboratories distributed among 17 cities across India. GARUDA provides 10TF of computing power in the Grid and at least 100TB of storage, but possibly a Petabyte. Its backbone connectivity is evolving to several Mbits/sec., according to the speaker.

GenomeGrid is an initiative towards Grid enabling and building a Grid portal for bioinformatics applications. The portal implifies access to a highly complicated supercomputing Grid, providing a user-friendly web interface that enables bioinformatics experts to utilize a maximum amount of available resources and allows the sharing of data. The web interface provides access to Grid-enabled sequence analysis tools including FASTA, BLAST, Smith-Waterman, Clustal-W and molecular modelling tools like AMBER.

GenomeGrid has a layered architecture. The three core services are provided by the Globus Toolkit. The resource management is provided by a Global Resource Allocation Manager (GRAM) which is responsible for a set of resources operating under the same site-specific allocation policy. It takes job specification in the form of Resource Specification Language (RSL) scripts and provides a building block for developing the resource brokers, the speaker explained. The information services are taken care of by a Monitoring and Discovery Service (MDS) and for the data management there is Grid FTP.

The Grid Security Infrastructure (GSI) is based on public-key cryptography and uses the Secure Socket Layer(SSL) protocol for mutual authentication between two systems in the Grid environment. In addition to these, it also provides single sign-on and delegation. The Java Commodity Grid (CoG) Kit application provider interface (API) offers a high-level interface to the services delivered by the Globus enabled Grid test bed in the fourth layer. Using the jglobus API, the following classes were developed for the GenomeGrid:

  • JobRun takes the RSL scripts for the application-specific scheduler in the second layer and submits the job to the remote systems using the GRAM API of the Java CoG Kit.
  • MDSData gathers the available idle CPU information from the systems in the Grid using the API and transfers this information to the application-specific scheduler in the second layer. Using this information an application-specific scheduler will select the best system depending on the user input and spawns the job on to the respective system.
  • GridFTP uses the GridFTP API provided by the jglobus libraries to transfer a file between the systems in the Grid environment. It takes the destination hostname and path of the file to transfer as an argument and sends the file to the respective system.

Dr. Joshi further elaborated on the GenomeGrid components and told the audience that the MySql Relational Database Management System is used for storing the user specific details such as username and e-mail address, for storing the job specific details for accounting purposes, and to maintain the queues. The Java Server Pages and Java API technology has been applied for developing the user interface through which the user can utilize the Grid environment. An application-logic layer was built by means of the Java API. The entire GenomeGrid package was deployed in the Jakarta-Tomcat Web Container in the server machine.

In a nutshell, the GenomeGrid Portal provides an interface to use the Grid resources remotely. Great advantage is that the portal hides the complexity associated with an HPC system and Grid resources. Additional features include secure job transactions; an application dependent job scheduler; job management tools to check the status of the job and to cancel it; and dynamic job creation on the fly based on user inputs. There is no need for supplementary software requirements to be imposed on the client/user machine except for visualization software.

The speaker pointed out that in order to make the portal more efficient, an application-specific scheduler has been included in the portal package. It gathers the input provided by the user from the interface and depending on the input parameters selects the best system for executing the job and submits the job to the respective system. The application-specific scheduler is divided into following sub-components: a resource predictor, a resource selector, a resource mapper, a job manager, and an automatic checker.

The algorithm used for the application-specific scheduler is following three steps:

  • Step 1: collecting input from the user
  • Step 2: depending on the input parameters, it takes a decision on the minimum number of processors required to submit the job
  • Step 3: from the data collected from the MDSData class, it checks whether the available resources are sufficient to submit:
    • If available resources are sufficient to run the job, it prepares the RSL script and submits the job to the best resource.
    • If the resources are insufficient, it stores the job details in the queue and waits for the availability of resources.

The speaker showed the audience that the GenomeGrid Portal is running MPICH-G2, a Grid-enabled implementation of Message Passing Interface (MPI 1.2.7). This allows to couple multiple machines, potentially of different architectures to run MPI applications. MPICH-G2 converts the data from one format to another depending on the receiving system. It uses Globus services to couple the machines, as well as TCP for inter-machine communication and vendor-specific MPI for the intramachine communication.

Dr. Joshi provided the audience with the results of a few applications running with MPICH-G2. The Smith-Waterman algorithm modified the programme to be able to work in an heterogeneous Grid environment. The algorithm compiled the modified code and kept the respective executables on each system. It executed the modified programme with a 3 KB query sequence (HIV-EP2 enhancer-binding protein) and showed a database sequence size of 91.52 MB (swissprot). FASTA equally modified the programme to work in an heterogeneous Grid environment and executed the modified code with a query sequence of 1.51 MB and a database sequence of 494MB (mam.fas).

The speaker concluded by indicating the future directions in working with the GenomeGrid Portal. The team is considering to replicate the database management and wants the job submission to be based on the location of the database. Further work needs to be done on the Grid enabling and on the development of the Grid portal for other widely used bioinformatics tools. One is also thinking about work flows to connect various applications and to build a complete pipeline for genome analysis using Grid nodes at the backend.

Authors of the selected ISC'07 paper on "GenomeGrid: A Framework for Bioinformatics Applications on a Computational Grid" are Murali Konnipati, Janaki Chintalapati, Krithika R., Vishwajit Labh, Amit Saxena, Rashmi Mahajan, and Dr. Rajendra Joshi.


Leslie Versweyveld

[Medical IT News][Calendar][Virtual Medical Worlds Community][News on Advanced IT]