|
The hardware vendors dominant strategy to high-end computing in the direction of Teraflops scale supercomputer systems was the integration of COTS hardware components with supporting system software and tools to facilitate coordinated concurrent operation and parallel programming in addition to sell these computers in the commercial market.
It failed providing a solution to the challenges of efficient and easily programmable high performance computing, as the components of these COTS architectures are not designed to support large scale parallel computing. They do not reflect a scalable execution model nor include mechanisms for efficient parallel computation and represent the physical integration and interconnection of independent sequential processing elements. Thus the software has to provide the paradigm, methods, and tools for achieving effective programmable parallel computing. In most cases this has proven difficult, time consuming, error prone while often exhibiting low efficiency.
Performance Challenges and Opportunities
Sterling listed some efficiency factors like
- Latency,
- Overhead,
- Contention
- Starvation
The latency is the number of cycles required to transfer a request to a remote resource (and back) and has impact on the utilisation of critical resources. Cache systems attempt to avoid latency. Other methods, as in the Earth Simulator, partially hide latency. Latency can be predicted but the additional delays caused by contention for shared resources of multiple requesting sources (e.g. memory bank conflicts) at run time cannot always be determined ahead. Starvation is the result of insufficient work of a processor either due to lack of programme parallelism or to poor load balancing. Conventional processor architecture incorporates little functionality to address these problems. Software and algorithmic techniques are the only opportunity.
Today the floating-point unit, once critical and most expensive, is now one of the least expensive to fabricate in VLSI. An FPU can take up as little die area as 1% of an entire chip (or less). Half the die area of a modern microprocessor chip will be consumed by cache.
Innovation in Architecture
Thomas Sterling discussed the streaming architecture being developed at Stanford University and the Trips architecture under development at the University of Texas. When the temporal locality is low or there is no temporal locality such that data access patterns fall into the category of touch once data, then the operations are best performed as close to the memory as possible and the FPUs may be merged directly on the DRAM dies. Here, latency and memory bandwidth is most important and the merge of logic and memory is referred to as processor in memory, or PIM. He mentioned projects like DIVA at USC ISI, PIM-lite at the University of Notre Dame, and the MIND architecture at the California Institute of Technology as an advanced class of general-purpose PIM architectures.
Sustainable Petaflops-scale performance should deliver the Cascade project by Cray Inc. in support of the DARPA High Productivity Computing System programme. This system comprises a potentially large set of interconnected "Locales", each incorporating a heavyweight processor (HWP) with many tightly coupled FPUs and a number of PIM chips with multiple memory/processor nodes (LWP) on each device.
Cascade is a shared-memory architecture, any HWP or LWP can access any word within the entire system. With hardware support for thread context switching and message processing, near fine grain processing can be made efficient.
Future Software Roles and Relationships
A major component of the software system will be the run time system to provide fine grain manipulation of resources and data in rapid response to both application requirements and operating system support. The run time system establishes a new relationship with the compiler to make best use of knowledge at each level of the decision tree. Mr. Sterling discussed some of the issues getting more information out of the code and using it in the decision tree.
The new relationship between hardware architecture and software support (compiler and run time system) needs a new autonomous intelligence in managing and mastering control of system operation. The executing application code must be fully virtualised from the physical hardware. Some of the issues of autonomous computing can be transferred to the software to isolate faults and correct them. He proposed a control decision tree to acquire the information from the user, the programme, the compiler, and the hardware at execution time to determine the best choices in resource allocation and task scheduling. An emergent run time software system will play an increasingly important role comprising synergistic agents with introspective threads. |