Evaluation [SaC-Home]

This section compares the runtime performance achieved by code compiled from the SAC specification of NAS-FT, as outlined in the previous section, with that of the FORTRAN-77 reference implementation coming with the NAS benchmark suite. Unlike PDE1 and NAS-MG, complete program execution times are relevant for NAS-FT. Again, the benchmark size classes W and A are chosen, which are defined as follows:

Class W: grid size 128 x 128 x 32 and 6 iterations,
Class A: grid size 256 x 256 x 128 and 6 iterations.


Single processor performance of NAS-FT


Speedups achieved by multithreaded execution

The figure above shows sequential execution times for SAC and FORTRAN-77. For both size classes investigated, the FORTRAN-77 reference implementation outperforms the SAC program by factors of about 2.8. To a large extent, this can be attributed to dynamic memory management overhead caused by the recursive decomposition of argument vectors when computing 1-dimensional FFTs. In contrast, the FORTRAN-77 implementation performs all 1-dimensional FFTs in place. Doing so not only renders memory management obsolete, but also saves a lot of data copying and improves cache utilization. Apart from this, the runtime performance of the FORTRAN-77 reference implementation is further improved by manually applying optimization techniques such as array padding and tiling.


Multithreaded runtime performance without SACPHM

As dynamic memory management overhead turns out to dominate the runtime performance of the SAC implementation, it is desirable to quantify the effect of the SAC-specific memory manager. The figure above shows the outcome of repeating the experiment with private heap management disabled. In fact, slowdowns of factors greater than 3 are encountered in sequential execution for both size classes. However, unlike NAS-MG, most memory management actually occurs during execution in multithreaded mode. In fact, the slowdown factor grows with every additional processor engaged. Comparing program execution times achieved by using 10 processors, private heap management accounts for a speedup of about 65 for both size classes investigated.