Two new software programmes that help address that challenge have recently earned silver-level compatibility certification from the National Cancer Institute's cancer Biomedical Informatics Grid, also known as caBIG. The programmes improve the process of identifying cancer biomarkers from gene expression data.
Developed by May Dongmei Wang and her team in the Wallace H. Coulter Department of Biomedical Engineering at Georgia Tech and Emory University, the programmes - caCORRECT and omniBioMarker - remove noise and artifacts, and identify and validate biomarkers from micro-array data. Funding to develop the programmes was provided by the National Institutes of Health, the Georgia Cancer Coalition, Microsoft Research and Hewlett-Packard.
"Certification by caBIG means the tools can be easily used by everyone in the cancer community to improve approaches to cancer detection, diagnosis, treatment and prevention", stated May Dongmei Wang, an associate professor in the Coulter Department and a Georgia Cancer Coalition Distinguished Cancer Scholar.
caBIG is a collaborative information network that enables researchers, physicians, and patients to share data, tools and knowledge to accelerate the discovery of new approaches that they hope will ultimately improve cancer patient outcomes. To become caBIG-certified, caCORRECT and omniBioMarker passed a rigorous set of requirements, ensuring the cancer research community that the software tools are high quality and interoperable with all other caBIG-certified systems for nationwide deployment.
caCORRECT - chip artifact CORRECTion - is a software programme that improves the quality of collected micro-array data, ultimately leading to improved biomarker selection. Widely used Affymetrix micro-arrays contain thousands of probes, each including a 25-oligo sequence, which are used to detect mRNA expression levels.
"Once someone has collected micro-array data, it is important to run quality control on it and remove any problematic points of data that could highlight incorrect biomarkers when analysed", explained May Dongmei Wang, who is also director of the biocomputing and bioinformatics core in the Emory-Georgia Tech National Cancer Institute Center for Cancer Nanotechnology Excellence (CCNE).
Since each micro-array chip contains thousands of spots, it is easy for a few spots to become marred by artifacts and noise. These unusable portions are typically the result of experimental variations by different laboratory technicians or errors that create scratches, edge effects and bubble effects on the data.
caCORRECT removes the noise and artifacts from the data, while retaining high-quality genes on the array. The software can also effectively recover lost information that has been obscured by artifacts.
In collaboration with Andrew N. Young, an associate professor in pathology and laboratory medicine at Emory University School of Medicine and clinical laboratory director at Grady Health System, May Dongmei Wang and graduate students Todd Stokes, Martin Ahrens and Richard Moffitt validated the caCORRECT software. A large-scale survey of public data and data from Andrew N. Young's laboratory demonstrated the ability of caCORRECT to assess and improve the quality of a wide array of datasets.
"caCORRECT is a quality assurance tool that allows researchers to utilize and trust imperfect experimental micro-array data that they spent a tremendous amount of time and money to generate", added May Dongmei Wang. "caCORRECT improves the downstream analysis of micro-array data and should be used before conducting biomarker selection, therapeutic target studies, or pathway analysis studies in bioinformatics and systems biology."
Once the quality of the data is assured with caCORRECT, researchers can use the caBIG-certified omniBioMarker software to identify and validate biomarkers from the high-throughput gene expression data.
Candidate cancer biomarkers are typically genes expressed at different levels in cancer patients compared to healthy subjects. omniBioMarker searches these groups of patient data for genes with the highest potential for accurately determining whether a patient has cancer. However, because individual genes are not expressed independently, the software also identifies groups of genes that act in concert.
The advantage of the omniBioMarker software is that it fine-tunes biomarker selection to a particular dataset or clinical problem based on prior biological knowledge. It also applies unique analysis parameters for each specific clinical problem. The parameters are optimal when the software selects genes that are known to be relevant biomarkers based on clinical observations and laboratory experiments available in literature and public databases. Then the software finds new potential biomarkers for experimental validation.
May Dongmei Wang, graduate student John Phan and Andrew N. Young tested the ability of the software to identify biomarkers in clinical renal cancer micro-array data. The researchers selected renal cancer for study because it has several distinct subtypes, which can appear in the same person in varying degrees and must be treated according to the diagnosed subtype to maximize treatment success. The results indicate that integrating prior laboratory and clinical knowledge with the micro-array data improves biomarker selection.
"Using omniBioMarker to create an optimal metric for ranking and identifying novel biomarkers reduces the number of false discoveries, increases the number of true discoveries, reduces the required time for validation and increases the overall efficiency of the process", noted May Dongmei Wang.
Since receiving caBIG silver-level compatibility certification for caCORRECT and omniBioMarker, May Dongmei Wang and her team have been working on getting two more software programmes certified - Q-IHC, a tool that analyses and quantifies multi-spectral images such as quantum dot-stained histopathological images, and omniVisGrid, a Grid-based tool that visualizes data and analysis processes of micro-arrays, biological pathways and clinical outcomes.
This work was funded by grant numbers R01CA108468, P20GM072069 and U54CA119338 from the National Institutes of Health (NIH). The content is solely the responsibility of the principal investigator and does not necessarily represent the official view of the NIH.