You can apply at the website http://wwwphp.lrz-muenchen.de/kursanmeldung/.
Details for each tutorial:
Cache Based Iterative Algorithms
Lecturer: Prof. Ulrich Rüde, Lehrstuhl für Informatik,
Universität Erlangen
Date: March 21, 2002 (Thursday)
Contents:
In order to mitigate the effect of the growing gap between the high
execution speed of modern RISC CPUs and the comparatively poor main memory
performance, computer architectures nowadays comprise several additional
levels of smaller and faster cache memories. These cache modules are located
physically between the processor and main memory.
Efficient program execution, i.e. high MFLOPS rates, can only be achieved if
the codes respect the underlying hierarchical memory design. Unfortunately,
today's compilers are still far away from automatically performing code
transformations like the ones we apply in order to achieve remarkable
speedups. As a consequence, much of this optimization effort is left to the
programmer.
In this tutorial, we will first discuss the underlying hardware properties
and then present both data layout optimizations, like array padding for
example, and data access optimizations, like e.g., loop blocking. The
application of these techniques to iterative numerical schemes --- like
Gauss-Seidel and multigrid --- can significantly enhance their cache
performance and thus reduce their execution times on a variety of machines.
We will consider both structured and unstructured grid computations.
Algorithms for Parallel Computers:
Classification, Design Principles and Implementation
Lecturer: Prof. Marian Vajtersic Technische Universität München,
Institut für Informatik,
Lehrstuhl für Rechentechnik und Rechnerorganisation
Date: March 22, 2002 (Friday)
Contents:
The efficiency of the exploitation of the computing power of
state-of-the-art parallel high-performance computers is heavily dependent on
their algorithmic and programming equipment. The development of algorithms
for these systems can be seen as a necessary prerequisite to obtain
satisfactory performance figures when solving large scientific-computing
applications. It is a known fact that for a parallel computer, the
difference between the performance of a good and bad algorithm may be easily
of a factor 1000 while this gap is much smaller for serial machines.
This course is devoted to the area of parallel algorithms. Its main goal is
to present some design principles and selected algorithmic solutions for two
of the most frequently used parallel computational models. We will not
concern ourselves too closely with the architecture and memory organization
of the machine, instead the intention is to extract the algorithmic kernel.
Also programming issues are not in the center of our interest when speaking
about the implementation, but rather a language-agnostic description of the
data distribution and interprocessor communication during the computational
process.
Selected algorithms are primarily from the problem area of parallel linear
algebra. This is not because of historical reasons (the first parallel
algorithms and computers where devoted to the iterative solution of partial
differential equations), however the solution of the majority of scientific
computing applications requires the use of linear algebra operations.
After some motivational arguments for exploiting parallelism, a brief
overview of the parallel computing models will be introduced. Basic
construction principles of parallel algorithmics will be the content of the
next block. We will show that the design of a parallel algorithm does not
necessarily imply straightforward modification of its serial counterpart,
but with some intricacy also original parallel methods can be developed.
Discussing the simple problem of a summation of a finite number of vector
components, the data-parallel model will be illustrated considering
different interconnection topologies. Furthermore, parallelization of
Cannon's, Winograd's and Strassen's matrix-multiplication techniques will be
presented for this parallel computational model. An efficient approach for
histogram evaluation will provide an example for bit-level SIMD (Single
Instruction Multiple Data) computing.
The final block brings some algorithms designed for MIMD (Multiple
Instruction Multiple Data) parallel computers. We will show how a system of
linear algebraic equations with block matrices can be solved by the
parallelized LU and matrix-decomposition methods. For a solution of the
computationally intensive problem of SVD (Singular Value Decomposition), the
two-sided block-Jacobi method will be chosen as a good candidate for
MIMD-parallelization. Two approaches will be compared: the first one takes a
given prescribed ordering for parallel solving the block-subproblems, while
the latter one selects the subproblems in a greedy manner. The
implementation results will accompany this discussion. As a concluding
example, the application of the parallel Jacobi-based SVD solver will be
presented.