Advanced Processor Technologies Home
APT Advanced Processor Technologies Research Group

COMPILER AND RUNTIME SUPPORT FOR HETEROGENEOUS PROGRAMMING

James Clarkson

Abstract

Over the last decade computer architectures have changed dramatically leaving us in a position where nearly every desktop computer, laptop, server or mobile phone, has at least one multi-core processor. These systems are also likely to include at least one many-core processor that is used for accelerating graphical applications - called a General Purpose Graphics Processor Unit or GPGPU. By properly utilising these two different processors a software developer could achieve up to two orders of mag- nitude improvement in performance and/or energy efficiency. Unfortunately, these im- provements in performance are often inaccessible to developers due to the combined complexity of understanding both the hardware architecture and the software needed to program them. It is this problem of inaccessibility that is explored within this the- sis with the goal being to determine whether it is possible to develop a programming language that allows an application to dynamically adapt to the system it is executing on. One of the salient issues is that a large amount of prior art is built atop of a closed- world assumption: that all the code and the devices it is to execute on are both known ahead of time and are fixed. An assumption that is becoming increasingly unworkable due to the proliferation of heterogeneous hardware. For instance, developers can now run applications in public clouds or mobile devices – contexts where it is difficult to anticipate what hardware an application is executing on, and where it likely that some form of hardware acceleration exists. Handling this uncertainty of not knowing what hardware is available until runtime is a fundamental problem of more statically compiled languages – like C, C++ and FORTRAN. In these languages, the closed- world assumption is obvious: a single processor architecture is assumed so that a single binary executable can be produced. It is the aim of this thesis is to determine whether it is possible to create a program- ming language that is able to target modern heterogeneous systems without requiring 15any closed-world assumptions about either the number or types of hardware accelera- tor contained within it. Consequently, this thesis introduces and evaluates Tornado: the first truly dynamic programming framework for modern heterogeneous architectures. The implementation of Tornado is unique as it comprises of three co-designed com- ponents: (1) the Tornado API that is designed to decouple the application code that decides on which device code should execute – the co-ordination logic of the appli- cation – away from the code that defines the computation – the computation logic of the application; (2) the Tornado Virtual Machine that provides a layer of virtualisation between the application and the underlying architecture of the heterogeneous system; and (3) the Tornado Runtime System – a dynamic optimising compiler that converts code written using the Tornado API into a format consumed by the Tornado Virtual Machine. Tornado has a number of distinguishing features that are a direct result of combining these three key components together. One of these features is the optimi- sation of co-ordination logic by the Tornado Runtime System – this allows Tornado to automatically minimise the cost of data movement in complex processing pipelines that span multiple devices. Another is dynamic configuration: the ability to have the Tornado Runtime System dynamically re-compile the application at runtime to use a different hardware accelerator, parallelisation scheme, or device setting. During the evaluation Tornado is tested across thirteen unique hardware acceler- ators: five multi-core processors, a discrete many-core accelerator, three embedded GPGPUs, and four discrete GPGPUs. In the evaluation it is shown that a complex real-world application, called Kinect Fusion, can be written in Tornado once and ex- ecuted across all of these devices. Moreover, this portable implementation written in Tornado is able to achieve a maximum speed-up of 55× on a NVIDIA Tesla K20m GPGPU. However, if a little portability can be sacrificed more specialised code can be written that produces a speed-up of 166× on the same device. Tornado is also com- pared against OpenCL - the state-of-the-art in heterogeneous programming languages - where the specialised implementations of Kinect Fusion run 22% slower and in the best case experience a speed-up of 14× (although this is in an unrealistic scenario). This level of performance translates to speed-ups over the original Java application of between 18× and 150×. Finally, Tornado has been open-sourced so that the reader is able to verify the claims made by this thesis and start writing their own hardware accelerated Java applications - https://github.com/beehive-lab/Tornado.

The thesis is available as PDF (**MB).