GriPhyN initially aims at giving scientists an integrated toolset to efficiently
access and analyse the vast amounts of data expected to flow from the world's
most ambitious physics and astronomy experiments, but it also could have
applications in the business world and elsewhere, said Paul Avery, lead scientist
and University of Florida professor of physics.
"We need to plan for these experiments now, because we can't wait until they
start," Avery said. "A personal computer today can do about a billion operations
per second. The overall computing power we need is about 1 million times more
than that."
GriPhyN represents a major addition to the broad-based research and development
projects that are developing the concept of a ubiquitous computational grid,
such as the NSF's Partnerships for Advanced Computational Infrastructure (PACI)
programme. GriPhyN will focus on the data-intensive problems characteristic of
the largest physics experiments. In order to handle these problems, existing
grid concepts must be extended so that computational, data-handling, and network
resources that transport the data can be co-scheduled in a co-ordinated fashion.
According to project co-leader Ian Foster, Professor in Computer Science at the
University of Chicago and Associate Director of the Mathematics and Computer
Science Division of Argonne National Laboratory, GriPhyN will contribute to the
efforts of the two PACI partnershipsthe National Computational Science
Alliance (Alliance) and the National Partnership for Advanced Computational
Infrastructure (NPACI)by developing new tools for data-intensive grid
computing and by bringing new research communities into the Grid
infrastructure.
GriPhyN will utilise and build upon two technologies developed by the US NPACI
institutions: Globus, developed by Alliance partner Argonne in collaboration
with USC's ISI, an NPACI partner; and Condor, developed by Alliance partner
Wisconsin. Globus is an integrated set of software components designed to
support Grid applications. Condor is a high-throughout computing system that
offers computing power by capturing cycles on idle machines. GriPhyN will push
further development of Globus and Condor as tools used in high-throughput
managed data access and processing across wide area networks. It will also build
on the work of the Particle Physics Data Grid Project, led by Caltech and
Stanford Linear Accelerator Center and will involve collaboration with the
European Union's DataGrid project, led by CERN.
GriPhyN involves more than a dozen institutions nationally and will pioneer a
new concept called virtual data, in which the entire resources of a scientific
collaboration become a single vast computing and storage system.
"Results will be computed only if and when needed," Foster said. "Much of the
time, the result you need will already have been computed by one of your
colleagues and the system will know where to find it."
The initiative initially will benefit four physics experiments that will explore
the fundamental forces of nature and the structure of the universe.
Two experiments at the European Laboratory for Particle Physics near Geneva will
search for the origins of mass using the Large Hadron Collider, which will
become the world's highest-energy particle collider when it begins operation in
2005. The Laser Interferometer Gravitational-wave Observatory, based in
Louisiana and Washington, will probe the gravitational waves of pulsars,
supernovae and other phenomena. The Sloan Digital Sky Survey, conducted from
Apache Point Observatory in New Mexico, by the University of Chicago and other
institutions, is carrying out a massive automated survey of the stars.
Each of these experiments will produce huge amounts of data that scientists
located at different institutions around the world will want to search and
manipulate. Genomics is another example of a major area of science where data
volumes are increasing much faster than analysis capabilities, Foster said. So
large are the data collections that scientists anticipate they will be measured
in petabytes, a quadrillion bytes.
The world's most powerful supercomputers today can store and process data
measured in terabytesa trillion bytes, each of which equals a thousand
gigabytes. One petabyte is roughly the amount of data that can be contained on a
million personal computer hard drives. A PC hard drive contains approximately a
gigabyte, which equals a billion bytes.
By tapping into the computer power of multiple institutions around the world, a
computational data grid could significantly boost both storage and calculating
capacity. The result will not reside at one location or one supercomputer, but
rather will be spread throughout the institutions, much like power plants
connected to an electrical grid.
"The electrical grid is a useful analogy, because users ranging from individuals
to large organisations will consume computing and data resources in greatly
differing amounts and they will not care where those resources are located,"
Avery said.
Scientists will need to have access to the data, but also the ability to carve
out chunks of it and manipulate the chunks to produce results. Because of their
size or the available computing power, the movement of these data chunks around
the network will have to be scheduled at different times, a task that will
require a kind of "intelligent" network.
"A worldwide community of perhaps thousands of physicists want to be able to
have their combined computer, storage and network resources used as a single
computing engine to solve their problems," Foster said. "This requires new
technology that can co-ordinate potentially thousands of processors, petabytes
of storage and a variety of high-speed and low-speed networks and cause them to
operate in some sense as a single analysis engine."
Although intended initially for science, GriPhyN could also prove useful for
large business applications, Avery said. For example, companies with multiple
sales outlets don't always store sales data in one central location. But
marketers hoping to identify consumer buying habits may wish to comb through
all the company's sales data to ferret out buying habits.
"There's a huge amount of interest in the technology that would allow companies to actually study these large archives of commerce data", Avery said.
The $11.9 million NSF grant is for research and development only, with no money
for hardware, Avery said. The researchers will seek a total of $70 million in
NSF grants for further research and equipment to build the system. Research and
construction likely would take place simultaneously, with a target completion
date of 2005, he said.
The institutions participating in GriPhyN are the University of Florida;
University of Chicago; University of Southern California; California Institute
of Technology; Harvard University; Indiana University; Johns Hopkins University,
Northwestern University, University of California, Berkeley; University of
California, San Diego; University of Illinois, Chicago; University of
Pennsylvania; Stanford University; University of Wisconsin, Madison; University
of Wisconsin, Milwaukee; and the University of Texas, Brownsville.
For more information about GriPhyN, see http://www.griphyn.org/. For a complete list of Information Technology Research awards
and project abstracts, see http://www.itr.nsf.gov/.