This section compares the runtime performance achieved by code compiled from the SAC specification of MG, as outlined in the previous section, with that of the FORTRAN-77 reference implementation coming with the NAS benchmark suite NAS Source. The settings of the following experiments are exactly the same as described in PDE1 Evaluation. Similar to the PDE1 benchmark, timing is restricted to multigrid iterations and, thus, ignores startup and finalization overhead. Several size classes are defined by the benchmark rules, two of which are selected for the following experiments:
The figure shows the single processor runtime performance achieved by SAC and by FORTRAN-77 for both size classes. In fact, the FORTRAN-77 program outperforms the compiled SAC code by 29.6% and by 23.0% for size classes W and A, respectively. Despite its considerably higher level of abstraction, the SAC specification achieves runtime performance characteristics which are in the same range as the rather well-tuned low-level FORTRAN-77 reference implementation. Moreover, performance differences diminish with increasing problem size.
A closer look at the FORTRAN implementation reveals that to a large extent a tricky hand optimizations of the stencil operation is mainly responsible for its superiour runtime performance. Symmetry properties in the definitions of the 4 different stencils used in relaxation operations are exploited not only within single iterations, but computations are also shared across iterations. Whereas a 27-point stencil, as used in the benchmark, in general, incurs 27 multiplications and 26 additions per iteration, the number of multiplications may be reduced to only 4 by taking into account that actually just 4 different weights occur in the stencil definition. This is realized by both FORTRAN and SAC. However, the FORTRAN code additionally stores results of computations shared among different iterations due to such symmetries in auxiliary buffers and, thus, reduces the number of additions to values between 12 and 20 depending on the concrete stencil.
This explains why in previous performance evaluations on this benchmark, SAC has been reported to outperform FORTRAN-77 by about 10%. Whereas preceding investigations were based on version 1.0 of the NAS benchmark suite, the experiments described here involve the most recent version 2.3, which comes with significantly improved reference implementations.
The figure above shows the runtime performance of SAC code implicitly compiled for multithreaded execution. Relative to sequential execution, speedups of up to 4.8 and up to 7.6 are achieved for size classes W and A, respectively. In fact, size class A is the smallest one intended for benchmarking, whereas size class W is specifically designed for program development on uniprocessor PCs and workstations.
The main scalability limitation arises from the repeated reduction of the grid size during the V-cycle. Whereas multithreaded execution pays for relatively large grids on the top end of the V-cycle, runtime overhead increasingly reduces its benefit for smaller grids on the bottom. Below a threshold grid size, all operations are actually performed sequentially to avoid excessive overhead. This sequential kernel of the NAS-MG benchmark limits its scalability, but its actual impact on performance decreases with growing initial grid size. Therefore, considerably higher speedups are achieved for size class A than for size class W.
Although, this problem is algorithm-inherent rather than implementation-specific, its impact on runtime performance is actually increased by dynamic memory management, as employed by SAC. Since the absolute overhead incurred by it is invariant against grid sizes involved, it is negligible for large grids but shows a growing performance impact with decreasing grid size. As a consequence, the sequential kernel of NAS-MG, where operations are performed on very small grids, is considerably more expensive with dynamic memory management in SAC than it is with a static memory layout in a low-level FORTRAN-77 implementation.
To investigate the effect of the SAC-specific memory manager, on the runtime performance of this benchmark, the experiment is repeated with private heap management disabled. Its outcome is shown above. Whereas single processor performance is hardly affected by the choice of the memory allocator, it has a significant impact on scalability. Maximal speedups relative to sequential execution drop from 4.82 to 3.55 for class W and from 7.63 to only 4.63 for class A.
However, the performance degradation may not be attributed to scalability limitations in the design of the SOLARIS standard allocator. In fact, all memory allocations and de-allocations in compiled code are performed in single-threaded execution mode. It is merely their general performance, which increases the time spent in sequential operations on small grids at the bottom of the V-cycle.