"As a high-performance data warehouse and large-scale data-mining environment, the new system allows the Feinstein Biorepository to manage vast amounts of information derived from the collection, processing and analysis of large numbers of biological specimens. A data-management system of this magnitude is unprecedented within academic research facilities", explained Anthony Ingraffea, CTC's acting Director.
The Biorepository at the Feinstein Institute was built in 1998 and has grown to store hundreds of thousands of human samples of different types, such as serum, plasma, DNA, cells, tissues and tumours, along with extensive amounts of associated data, to support many large scientific studies. Both control and disease-affected samples are collected and managed along with clinical, laboratory and bioinformatics data.
One segment of sample analysis that has grown dramatically in the past six months is the identification of single nucleotide polymorphisms, or SNPs. SNPs are DNA sequence variations that occur when a single nucleotide in a genome sequence is altered. SNPs make up about 90 percent of all human genetic variation and scientists believe SNPs may predispose people to a disease or influence their response to a drug. Currently, researchers at the Feinstein Institute are generating approximately eight to 10 million SNP genotypes each day, and they anticipate accumulating three billion or more SNP genotypes over the next year.
"The difficulties in managing and manipulating these very large datasets required the creation of a new data centre capable of high-performance data management", stated Robert Lundsten, Biorepository Director. "Management of research-subject annotation is also quickly becoming a high-performance computing issue", he added.
The Feinstein Biorepository informatics system includes a symmetrical multi-processor (SMP) Unisys ES7000 computer expandable to 256 GB of RAM running four 64-bit Intel Itanium 2 processors expandable to 64. The system is unique in that it runs Microsoft Windows Server 2003 Enterprise Edition R2. The platform was designed to run SQL Server 2005 64-bit Enterprise Edition. Data is stored directly through four host bus adapters to an EMC CLARiiON CX300 RAID disk array. The computing centre also has an assortment of in-house 32-bit applications running on Dell PowerEdge servers and Dell PowerVault disk arrays.
Creating an efficient data-management environment is the first step in developing an effective data-mining environment. "CTC's experience in data-management design was very helpful", Robert Lundsten emphasised. "They know how to design systems and databases that optimize performance in a Microsoft Windows and SQL 2005 environment."
CTC is an interdisciplinary research centre at Cornell University focused on providing cyberinfrastructure resources for research and education; these resources include high-performance and data-intensive computing hardware and expertise, visualization, and K-12 outreach. Scientific and engineering projects supported by CTC represent a wide variety of disciplines, including bioinformatics, behavioural and social sciences, computer science, engineering, geosciences, mathematics, physical sciences, and business. CTC is leveraged by business and industry to stay on the cutting edge of high-performance computing, storage, and database management technologies.
Located in Manhasset, New York, the Feinstein Institute for Medical Research is part of the North Shore-Long Island Jewish Health System, New York State's largest health care network. The Feinstein is among the top six percent of all institutions nationally that receive funding from the National Institutes of Health. Building on its strengths in immunology and inflammation, oncology and cell biology, human genetics, and neurodegenerative and psychiatric disorders, its goal is to understand the biological processes that underlie various diseases and translate this knowledge into new tools for diagnosis and treatment.