Peskin's research, and software called PULSE 3D, centers on his knowledge that simulated real-time heartbeats will lead to better treatment of heart disease. Instead of relying on extensive testing and exploratory surgery, with long delays in obtaining critical information, doctors will someday view a simulation of a patient's heartbeat from the relative comfort of a medical office, using a common desktop computer linked to a larger and very fast computer. A real-time view of a virtual heart will provide doctors with more precise evaluations of heart actions and enable effective tests of replacement parts for diseased hearts.
The heart is a complex muscle that delivers blood throughout the body with machine-like precision. Scientists in turn can better understand the machinery of the heart by using another machine, the supercomputer. Charles Peskin, his colleague Dr. David McQueen, and other researchers constantly seek improvements in the operation of both machines. Their goal of real-time virtual modelling of the body will occur only with a vast increase in power and speed from these specialized computers that are already the fastest computing devices ever created.
The simulated heart, however, renders even the supercomputer inadequate and poses a daunting supercomputing problem. The fluid dynamics of blood, the heart's hundreds of fibers, and its regulation of carbon dioxide and oxygen, require an extremely complex simulation. The task requires a new type of supercomputer, based on a distinctly different architecture, which raises the ceiling of computational power and processing time. The design of conventional supercomputers contributes to the need for this new model of computing. Despite huge advances in computing technology, speed, and performance, even supercomputers cannot run some applications effectively. Peskin's PULSE 3D and other similar modelling applications demand the power of hundreds of thousands of individual computer processors working in parallel. While such massively parallel computers exist, to date they have proved inadequate to the challenge of simulated heartbeats. After years of research and complex programming, current supercomputers now take a week to produce a single, three-dimensional image of a heartbeat. That timeframe is not acceptable; scientists realize that simulated hearts must beat several times, and in a series, before the virtual heart offers therapy. The current state of medical and supercomputing technology means that these successful but very slow models are, says David McQueen, "interesting but not useful".
Even though today's animation of the heart is still not as realistic and timely as they would like, scientists are optimistic that the speed and visual quality they seek will become available in a new type of supercompter manufactured by the Tera Computer Company of Seattle, Washington. Tera computers rely on an innovative architecture designed around multiple threads (many lines of independent work) in a programme. Known as an MTA machine (for its multithreaded architecture), Tera's product works more efficiently than massively parallel computers, requires less programming, and will lead to detailed and fast simulations of everything from the human body to car crashes.
The proprietary architecture of an MTA machine is an advance on the majority of computer designs developed over the last 50 years. Today's computers suffer from the so-called "von Neumann bottleneck", a problem of memory usage that is a fundamental issue in computing.
During the 1940s and 1950s, mathematician John von Neumann designed and built a stored programme computer. His machine, and many subsequent computers, utilized memory that stored both instructions and data for the calculations. A processor interpreted the instructions and completed the calculations. The processor both retrieved the information from memory and stored the computational results. Reflecting von Neumann's extraordinarily broad view of computational theory, his computer remains a tribute to his genius.
Many modern computers, still relying on the von Neumann design, spend time moving information between memory and processor. In the last 50 years, the speed of processors and the capability of memory has increased substantially. The time to move information between a computer's processor and the machine's memory, however, has not increased substantially. This condition, known as memory latency, hobbles the performance of supercomputers. Given the physical limitations of integrated circuitry and the speed at which electrons can travel through it, a single processor can attain only a certain level of performance. No matter how fast its clock speed, the processor is idling while waiting for memory, instead of performing productive work.
Consequently, the vastly increased computational power promised by a conventional parallel processing supercomputer goes unrealized as its multiple processors must wait for the information stored in the computer's relatively slow memory to catch up. The specialized, high-performance applications that rely on this architecture have two key characteristics: they are unstructured and dynamic. Unstructured codes have irregular data access patterns, which makes them particularly frustrating to run on cache-based architectures that attempt to hide latency by making repeated use of the data. Dynamic applications have workloads that vary during the runtime of the computation, requiring dynamic balancing of these workloads, a costly and formidable task on distributed memory machines.
A Tera machine, however, overcomes these obstacles and drives the ghost of John von Neumann out of the supercomputer. In its state-of-the art production facilities in Seattle's historic Pioneer Square, Tera is perfecting a proprietary supercomputing architecture that virtually eliminates latency and delivers the full potential of parallel processing. Tera's multithreaded processor architecture significantly increases computational speed through a vastly simpler design. MTA systems with up to eight processors have been built so far. The company has an order for a 16-processor system and plans to double the size of the MTA systems every six months. Tera's plans call for systems with 64 or 128 processors by the year 2001.
How does multithreading differ and improve upon other types of supercomputer architecture? Most supercomputers are mono-threaded, meaning they process only one thread of computation at a time. That thread remains idle as it waits for information to move between processor and memory. An MTA machine, however, implements multithreading, a means of exploiting the parallelism inherent in any application. These threads run on an entirely new kind of processor with 128 hardware streams, as opposed to the single stream of traditional processors. The processors switch between different threads at every clock tick of the computer with no computational cost. A new instruction is issued and threads already in the instruction pipeline move forward. The processor is connected through a high-bandwidth network to memory, so that while some threads may be waiting for memory operations, others are still performing productive work. This system tolerates even very long latency times, so with sufficient parallel work, the processor is never idled.
As the hardware increases the number and effectiveness of the computations, a correspondingly lesser effort is required of programmers. Charles Peskin, for example, notes that he and David McQueen wrote a tremendous amount of extra programming code for PULSE 3D, so that the application functions within the limitations imposed by a parallel processing computer. The structure of memory in these computers requires the allocation of data to multiple processors and specific memory locations. This data must then be recombined into a single set of data. Such workarounds are not necessary on an MTA machine. Tera's innovative hardware architecture allows simpler programming. Relying on hardware solutions to programming problems, the user of a Tera machine writes less code to achieve a better programme.
A current Tera customer, the San Diego Supercomputing Center (SDSC), utilized an eight-processor MTA system to port PULSE 3D to a Tera machine. Other applications being ported to the MTA, either by SDSC or Tera, include MSC Software's NASTRAN, a structural analysis code used extensively by the automobile and aerospace industries; Livermore Software's LS-DYNA, which can simulate physical occurrences such as car crashes and metal stamping; GAUSSIAN 98, a computational chemistry application used in molecular modelling; and MPIRE (for Massively Parallel Interactive Rendering Environment), a powerful graphics and animation application that visualizes complex phenomena. Tera expects each of these applications will run significantly faster on an MTA machine than on any other computer system.
MPIRE, for example, faces the same limitations imposed on PULSE 3D by parallel processing supercomputers. As MPIRE spreads many individual pieces of data across the multiple processors and recombines the data in an image of, for example, a nebula (clouds of gas or dust, or a group of stars), the process breaks down. "It's suddenly not so simple", says Greg Johnson of SDSC, "to take the (data), break it up into contiguous pieces, and distribute it to all the processors, have them do their thing, and then combine the results in a nice orderly fashion". With the parallel processing machine, notes Johnson, "there's all kinds of data locality and load balancing issues, that vary based on viewpoint. So things get ugly very quickly".
A Tera MTA machine solves this problem with innovations in memory usage. A conventional computer uses memory to hold the instructions and data necessary to conduct its computations and complete its tasks. Typically, memory exists vertically (in a logical, not physical sense) in a virtual hierarchy. Data can exist at any level in this arrangement and is often stored in caches that place important or recurring data close to the processor for quicker computation. Data stored in memory is always moving up and down the hierarchy and the actions of individual processors are linked to specific memory locations.
This arrangement works well on platforms that run less demanding programmes than virtual real time models of the world around us. But programmes like MPIRE, with complex images based on data undergoing constant recalculating, require shared memory to function efficiently.
An MTA machine solves this memory problem with a single global memory arrangement called shared memory that is flat, not vertical. Each processor has equal access to all of the memory. The bandwidth of the network that connects processors and memory expands in direct proportion to the number of processors, as does the memory size. All processors communicate with memory in effectively equal times, creating a flat shared memory.
Consequently, an image such as the gaseous nebula described by Johnson, is rendered whole, eliminating the major task of re-combining its scattered data. One of the innovations of Tera's multithreaded architecture is that Tera can offer scalable shared memory to potentially thousands of processors. Systems built to the von Neumann model, however, have been unable to scale flat shared memory to more than a few processors.
Global shared memory allows an MTA machine to conduct volume rendering using a perspective viewing model (as opposed to polygon-based imaging, which does not present images such as nebulae so clearly). Volume rendering, the processing of multi-gigabyte volume data sets at near-interactive speeds, adds realism to computer graphics by adding three-dimensional qualities such as shadows and variations in colour and shading. Consequently, says Johnson, "on the MTA, the processors simply perform the rendering calculations, but because the volume is stored in shared memory we can throw away the stage where processors have to combine different pictures showing each little piece of the volume to get a final image".
From star gazers to heart surgeons, the ultimate value of MTA machines lies in their ability to run specialized applications faster and with outstanding visual clarity. For example, automobile designers rely on LS-DYNA for the virtual testing of car crashes and improvements in automobile safety. Tera expects that an MTA machine can dramatically increase the speed and visual quality of a simulated car crash. The designers see more crucial information on screen, wreck fewer test cars, work faster, and simulate their work at a lower cost. When a future, more powerful MTA machine runs PULSE 3D, researchers can easily-and repeatedly-visualize the action of a replacement valve that corrects a heart defect. Other contributions of MTA to heart research may include improved modelling of human electrical activity, which controls and coordinates both the heartbeat, and improved depictions of the blood clotting process.
From reducing test costs to allowing quick and repeatable views of human organs, the new architecture of supercomputers, especially Tera's MTA, promises exciting advances in the understanding of our world. By implementing its multithreaded architecture, Tera has set the stage for the next era in computing.
This article appears in VMW Magazine with the kind permission of Dick Russell from Tera Computer Company, Seattle, Washington, and the author, Jeff Hickey. The article was first published by HPCwire during SC'99.
Jeff Hickey is a technical writer with The Write Stuff of Seattle, Washington, USA, which is a provider of technical communication, translation and staffing services. You can contact Jeff Hickey at email@example.com.