Ongoing and Future Projects
Algebraic With-Loop-Folding [AWLF]
Lead Developers: Robert Bernecky
The Algebraic With-Loop Folding (AWLF) project extends the capability of the With-Loop Folding optimization to operate on AKD arrays. To do this, it has to perform symbolic analysis of WL bounds and array index sets and their intersections. AWLF enables -ecc, as that improves the analysis quality. As a fringe benefit, CF has been extended to remove, whenever feasible, guards that can be proven always satisfied, thereby improving run-time performance when compiling with -check c.
Current Status: Works on odd-numbered days. No support for non-unit stride; no support for non-unit width. Solving of the symbiotic expressions used for the above symbolic analysis is problematic in several ways. The primary problem is that there is no way to tell, at present, that an analysis is complete. This can lead to AWLF failure to operate, and/or to nugatory hypercube slicing.
- Extrema are not passed into LACFUNs, at present. This inhibits the quality of AWLF within LACFUNs and their sub-functions. Work is underway to fix this (rbe).
- Extrema are not generated for FOR-loop induction variables. Work is underway to fix this (Artem). See sacdev email for 2012-03-14/15.
- SAA passing of AVIS_DIM/SHAPE information into LACFUNs is non-optimal. In particular, two arrays with the same shape will have their shape information passed as two distinct N_id nodes, thereby losing the vital information about their shapes matching. Furthermore, if the shape(s) are represented outside the LACFUN as N_array nodes, that information is also lost. Both of these inhibit operation of AWLF in LACFUNs.
- LIR is broken, as evidenced by Bug #870 and friends. This has a disasterous effect on performance of any code involving for-loops.
Algebraic With-Loop-Folding of non-WL[AWLF]
Lead Developers: Robert Bernecky
The Algebraic With-Loop Folding (AWLF) project extends the capability of the With-Loop Folding optimization One problem we have observed in AWLF, specifically in the apex/logd3/logd.sac benchmark, is that WLSIMP eliminates a degenerate WL, replacing it by a sel() into a producerWL that may be otherwise unused. This results in degraded performance, because the producerWL generates a large intermediate result, but we only use one element of that result.
Current Status: Nothing done yet. (2012-03-22)
- Treat the sel() as if it were a consumerWL, and extend AWLF (AWLFI, AWLF) to perform AWLF on the sel().
An Eclipse plug-in for SaC
Lead Developers: Volkmar Wieser, TBD
The Eclipse project provides a multi-platform framework for creating IDEs (Integrated Development Environments). It has been successfully used for many different programming languages. A key to its success is the clear separation between the non-language related core facilities and the language specific parts which are kept in so-called plug-ins. This concept allows new IDEs to be created with rather small programming effort by efficiently making use of the core functionality of Eclipse.
The central idea of the project is to make use of the Eclipse infrastructure to generate an IDE for SaC. A good starting point for doing so is the web-site of Eclipse. It contains several tutorials on how to create new plug-ins for Eclipse.
Current Status: The basic plugin has been created; some polishing is still needed.
- Handling of C-macros
- Integration of the typechecker
- support for auto-correction
MT CUDA Integration
Lead Developers: Miguel Sousa Diogo, TBD
The ultimate goal is to achieve automatic heterogeneous computation, having a host and a accelerator cooperate in computing with-loops. This would currently be achieved by basically merging the MT and CUDA backends of the SaC compiler, so at the moment one can think of the host as being a multicore machine and the accelerator as being a Nvidia graphics card. Kernels are unsuited for MT, and vice-versa. Dividing with-loops between host and device would leave computation results distributed across device and host memory. The available SaC primitives can only copy whole arrays, and are thus unsuitable. Literature suggests that dynamic scheduling is more effective on heterogeneous systems. With dynamic scheduling we can no longer define all data transfers statically, like in the CUDA backend. Memory transfers are bad, we want to make as few and as short as possible.
Current Status: As of 29/03/2012 Generating CUDA and MT with-loop versions is currently implemented by duplicating the with-loops early. The copy is unmarked for CUDA and both are placed inside a conditional, each on a different branche. The condition itself is a dummy for now. Regular transformations from the CUDA and MT backend then operate on their own separate copies, resulting in both different version of the same with-loop.
- Implement distributed variables, and their control structure. For simplicity, arrays are to be split into blocks of some fixed size, each block tracked separately using a simple MSI scheme.
- Implement conversion functions for distributed variables.
- Introduce conversion functions before array accesses, and after writes to arrays.
- Assign a worker thread to manage CUDA.
- Place whole CUDA/host with-loop conditionals inside SPMD functions.
- Change dummy condition to a check for the CUDA management thread.
- Place the dynamic scheduler loop inside each branch, encapsulating the with-loop and required conversion functions.
For more info, check paper on tex repository under projects/cudahybrid/tfp12.
N-bit Integer Support for Improved Concurrency
Lead Developers: Artem Shinkarov, TBD
Many applications that deal with whole numbers do not require the full range provided by the default integer type. While this may be negligible in cases where only a few such numbers are needed, when dealing with large arrays of them the situation is radically different. Using “oversise” data types does not only constitute a waste of memory space but it also increases the pressure on memory bandwidth and it limits the potential gains from vectorised operations.
The development of a compilation strategy for m-dimensional arrays of n-bit integers poses a major challenge when attempting to target a wide range of architectures with vector instructions including mainstream architectures such as X86 and Sparc as well as novel multicores such as the Cell Broadband Engine or Intel's Larrabee processor.
Current Status: Artem looked into this and ended up doing layout transformations for vectorisation. This work should build on Artem's vectorisation and introduce SIMD in a register support.
- Integration of new n-bit data types
- translation schemes for SIMD in a register
- implementation of SIMD in a register; including padding for n-dimensional arrays
- performance evaluation
Optimising the optimisation cycle
Lead Developers TBD
Over the years many standard and non-standard optimizations have been integrated into the SaC compiler sac2c. Most of the optimizations benefit from or even rely on the application of other optimizations. As there exist mutual dependencies, the current compiler applies its optimizations by means of several subsequent optimization cycles. However, the dependencies between the individual optimizations have never been fully analysed. Instead the order of their application has been chosen in a rather ad-hoc fashion.
Another problem of the existing optimization cycle is the fact that the optimizations have been developed over a rather long period of time during which some of the side-conditions (such as restrictions on the AST etc) have changed. As a consequence, some of the optimizations rely on assumptions that are only partially true in today's setting, e.g., some well-formedness assumptions do not hold anymore which – strictly speaking – may require some extensions of these optimizations.
The challenge of this project is to analyse all optimisations for correctness in the given setting and for relationship to other optimizations. Building up on that information, correctness-issues should be resolved and an attempt to optimize the order of optimization application should be undertaken.
Current Status: nothing done yet.
- Analyse all optimisations for correctness in the given setting and for relationship to other optimizations.
- investigate possible approaches for optimising the cycle.
User-Defined Hybrid Type Systems with Subtyping on Data-Indexed Type Families
Lead Developers: Stephan Herhut, Santanu Dash, TBD
The combination of subtyping and partial evaluation has proven in the context of SaC that it combines nicely rather generic program specifications with a high level of static type guarantees and very efficient runtime performance. To increase the versatility of static type guarantees, more sophisticated type definitions and the ability to specify subtype relations explicitly are needed.
The ideal candidate will have a strong interest in novel approaches towards dynamic type systems, and, ideally some background in modern type systems, compilation techniques, and functional programming in general.
Current Status: Some of the theoretical foundations of this work have been investigated in the team already and led to Stephan's PhD thesis. Santanu has performed ground breaking work on computing lubs in lattices more efficiently.
- Development of the type system
- Development of syntactical support
- Implementation of type inference
- Evaluation of case studies
Visualising Array Programming by Spreadsheets
Lead Developers: TBD
Array programming as provided by languages such as APL, J, or SaC allows for a high level of abstraction. Most operations of these languages are applicable to arrays of arbitrary rank (number of axes) and arbitrary shape (extends wrt. individual axes). This generality of operators combined with the linearised form of traditional programming style creates difficulties for novice array programmers as not only values of expressions but their shapes as well have to be kept in mind. A graphical programming tool for displaying arrays and for composing simple non-recursive array operations would help overcoming this problem as the shapes and the content of arbitrary intermediate expressions could be visualised.
The idea of this project is to display individual arrays in separate windows in a spreadsheet-style manner. These windows should be linked by means of functional array expressions that define how one window's content is computed from one or several other window's contents. The graphical user interface preferably is created as a Java applet (see java.sun.com for details and tutorials) as this eases portability. While the graphical parts of this tool have to be created from scratch, the underlying array programming machinery can directly make use of one of the existing array programming languages. The compiled array programming language SaC due to its side-effect free core language as well as its C interface seems to be particularly suitable for this task.
Current Status: Nothing done yet
Alternative Storage Layouts for Improved Vectorisation
Lead Developers: Artem Shinkarovs (2015)
The arrival of multi-core processors in the mainstream forebodes a general shift into distributed computing. An effective utilization of such systems requires the software to be used on such systems to consist of chunks of work that can be operated on independently and locally. For large data structures such as arrays a reorganisation of the way the data is stored in memory may have a vast effect on the localality of data and, thus, on the runtime efficiency that can be achieved.
This project investigates the effectiveness of alternative layouts by using the high-performance array language SaC. The intended approach is to re-write several SaC programs in a way that mimicks alternative layouts. Extensive performance comparisons will then give an indication of the effectiveness of the layouts investigated.
Current Status: finished.
A Universal Benchmarking Tool
Lead Developers: Maxime Castelein and Julien Viera (2005)/ Daniel Rolls (2008)
prototype has been developed by Maxime and Julien. However, first experiences indicate room for some improvements to make the tool applicable on a larger scale. Dan tries to put these into action.
Benchmarking programs is always a time consuming task, in particular, when comparing programs specified in different programming languages: comparable versions of the same program need to be collected, compilers need to be installed, compiler flags need to be identified, program parameters need to be adjusted to the performance / capabilities of the executing hardware etc.
The situation becomes even more annoying if measurements need to be redone on a slightly different platform as it usualy requires all ad-hoc mechanisms for measuring, such as makefiles or scripts, to be adjusted.
The core idea of this project is to develop an automated benchmarking system which keeps all relevant measurement parameters in a database and automatically deploys measurements whenever a new benchmark or executing platform is added.
Current Status: finished.
Concurrent Dynamic Compilation Based on Statistical Data
Lead Developers: Tim van Deurzen (2010), Heinz Wiesinger (2016)
It is well known that compiler optimisations can benefit vastly from statistical information gathered at runtime. Traditionally, this is done by utilising profiling data from previous program executions. With the newest trend towards manycore systems it becomes feasible to apply such optimisations while the program is still running.
The challenge in this project is to bring together profiling based optimisations with just in time compilation techniques in a non-disruptive way.
Current Status: Heinz is working on the second revision which permits specialisations to persist.
Needed Work: ??
High-Performance Distributed Computing on Graphics Processors Using Nvidia's CUDA
Lead Developers: Jing Guo (2012)
Most modern desktop machines contain very powerful multiprocessor sub-systems on their graphics cards. Nvidia has now developed a freely accessible development environment named CUDA. It enables C programmers to conveniently program Nvidia's line of graphics processors.
The goal of this project is to investigate the suitability of the CUDA system as a back-end target of the high-performance compiler sac2c which compiles programs written in SaC into C programs.
The intended approach is to take a few prototypical C program kernels (less than 100 lines of code each), to translate these into the CUDA framework, and to analyze the potential performance gains in depth. The chosen kernels should represent the sac2c compiler output for the most important language constructs of SaC so that the performance analyzes can be used as an indication of the potential performance benefits.
Current Status: finished.
Introducing Controlled Forms of Non-Determinism -- Side-Effects within With-Loops
Lead Developers: Theo van Klaveren (2006), Torben Gerhards (2007)
The central language for data parallelism in SaC, the With-Loop, is side-effect-free. Although this is essential when it comes to concurrent executions, in some situations, this restriction is rather inconvenient. Examples for such situations include:
dbug output within the body of a With-Loop pixel-wise output of computer generated images arrays of random numbers In all these cases, we are not interested in the exact order the output / side-effect happens. Instead, we want as much concurrency as possible. However, the non-determinism should be confined within With-Loop boundaries.
Current Status: finished.
MT for LPEL
Lead Developers: Jaroslav (“Jara”) Sykora (2013)
The goal is to use LPEL instead of Pthread to enable several SaC instances to smoothly share resources between them and, ultimately, dynamically redistribute them.
Current Status: finished.