» R6: Distributed HPC for DL

Research line leader: Henri Bal (VU)

R6 challenges are:

Understand the mapping of several important DL libraries onto large-scale many-core based distributed computing infrastructures.
How to use this knowledge to optimize DL applications and DL network graphs?
How to exploit hardware heterogeneity to improve the efficiency of distributed DL applications?
Design methods and tools for Deep Learning as a service, for various use cases.

High Performance Computing (HPC) is indispensable for large-scale DL. Large DL problems often can only be addressed by distributing the computations and data over many compute nodes in a cluster, supercomputer, or cloud facility. Unfortunately, this approach heavily complicates programming, especially as many users are experts in particular application domains but not in parallel or distributed systems. Also, HPC infrastructures have become increasingly complex and diverse over the past decade. There is a variety of new many-core processors suitable for DL (e.g., NVIDIA’s Volta and Intel’s Knights Landing), but they all require different and highly complicated optimizations. Worse, realistic HPC infrastructures (e.g., the SURFsara Cartesius supercomputer used in this program) contain a mix of different types of compute nodes and many-core accelerators, and thus have become heterogeneous. It is therefore important to bridge this knowledge gap and make advanced HPC infrastructures more accessible to DL experts, ideally presenting high-performance DL as a service to users. Therefore, in this research line we will study the mapping of DL libraries onto distributed computing architectures, aiming to better understand which architecture is best suited for which library and application. This research will thus teach us how to deal with heterogeneous distributed infrastructures and possibly even enables us to benefit from heterogeneity by mapping different program parts to different types of machines, to improve (energy or computation) efficiency. More concretely, we will study a set of frequently used DL libraries for which distributed implementations exist (Google TensorFlow, MXNet, Caffe) and analyze their performance on important many-core machines (from Cartesius and DAS-5) for large EDL applications and data sets (from Schiphol Airport, Astron, FEI, ING, etc.). We will use these insights to optimize applications, for example by restructuring the DL network graphs to exploit the properties of the underlying infrastructure.