As in languages, where there are sequences of letters which fall into patterns that make them understandable, there are sequences of amino acids in proteins that can be read to understand their structure, dynamics, and function. Sequences of amino acids and their constituents can be thought of as syllables or words that have particular properties.
A deeper understanding of the relationship between protein structure, dynamics and function can help to extract information hidden in the gene sequences of genomes, which may, in turn, help develop drugs to fight disease. Today, there is great societal demand to understand and treat degenerative diseases, many of which are based on defective triggers for protein shape and interactions.
The project's principal investigators are Raj Reddy, Carnegie Mellon's Simon University professor of computer science and robotics, and Judith Klein-Seetharaman, assistant professor of pharmacology at the University of Pittsburgh Medical School, who also holds an appointment at Carnegie Mellon's Language Technologies Institute (LTI).
"The Human Genome Project and related genome sequencing efforts have provided a wealth of data, which has stirred great hopes for increasing our understanding and treating of disease or for mimicking nature's inventions in nanomachine design", stated Judith Klein-Seetharaman. "But the precise relationship between a primary sequence and the structure, dynamics and function of the encoded proteins is one of the most fundamental unanswered questions in biology."
"The computational biolinguistics project promises to provide novel views and approaches to solving these challenges that would not be obvious without thinking in terms of the analogy between language and biology."
Carnegie Mellon will be the central site for the computational biolinguistics project. Its scientists will supply all of the necessary computational and language modelling technologies. Other partners will provide the bulk of biological and proteomic research and the laboratories where experimental work will take place.
There is also an industrial component to the project. Mathworks Inc. of Natick, Massachusetts, will work with Carnegie Mellon scientists to enhance its MatLab mathematical software to better support computational biolinguistics research. Medstory Inc. based in Burlingame, California, which deals with drug innovation informatics, will focus on the clinical and drug development relevance of computational discoveries made under this programme.
Professors Reddy and Klein-Seetharaman, together with Language Technologies Institute director and computer science Professor, Jaime Carbonell, and LTI associate professors Ronald Rosenfeld and Yiming Yang, have been doing preliminary work in computational biolinguistics for nearly two years. By applying statistical language modelling technologies to genome sequences, they have been able to detect protein fragment signatures from pathogens.
The computational biolinguistics grant is one of more than 300 announced by the National Science Foundation as part of its Information Technology Research (ITR) programme. This year, the National Science Foundation awarded a total of $144 million in new grants under the programme.