UDEL GPU Programming Hackathon, 2016

DAY 1, May 02 – 2016

<for more details on the hackathon>

Today, May 02 2016, GPU Programming Hackathon co-organized with Oak Ridge National kicked off today at the University of Delaware. Dr. Eric Nielsen from NASA Langley gave an invited talk on FUN3D-arge-scale computational fluid dynamics solver for complex aerodynamic flows seen in a broad range of aerospace (and other) applications.

6 teams participate in this hackathon that aims to meet several kinds of expectations a) Educate teams to program GPUs using high-level directive-based programming models, OpenACC, OpenMP c) Train teams to accelerate their codes on GPUs or CPUs d) Provide teams with a clear roadmap on how to tap into massive potential that GPUs can offer.

Several mentors from NVIDIA/PGI, UTK, Cornell, ORNL and UDEL with extensive programming experience are on site at UDEL to work with the teams to facilitate them meet their desired expectations.

Teams are paired up mentors. Goals are set. White boards are filled up with equations, tables, figures and what not! Codes are being migrated to ORNL’s TITAN (world’s second largest supercomputer) as we speak!

It’s Showtime, folks!!

The participating teams are:

  • NASA Langley

FUN3D is a large-scale computational fluid dynamics solver for complex aerodynamic flows seen in a broad range of aerospace (and other) applications. FUN3D solves the Navier-Stokes equations on unstructured meshes using a node-based finite volume spatial discretization with implicit time advancement. FUN3D is approximately 800,000 lines of code.

FUN3D is predominantly written in Fortran 200x. The code can make use of a broad range of third-party libraries, depending on the application. At a minimum, an MPI implementation must be available, as well as one of the supported partitioning packages (Metis, ParMetis, or Zoltan).

FUN3D is used across the country on a wide range of systems, from desktops to large HPC resources. The code has been run out to 80,000 cores (CPU) on TITAN. Select kernels have been ported to GPU, with the majority of effort to-date spent on the workhorse linear solver (multicolor point-implicit). Some OpenACC, some CUDA Fortran through an ongoing collaboration with NVIDIA.

FUN3D is widely used to support major national research and engineering efforts, both within NASA and among groups across U.S. industry, the Department of Defense, and academia. A past collaboration with the Department of Energy received the Gordon Bell Prize. Some applications that FUN3D currently supports include:

– NASA aeronautics research, spanning fixed-wing applications, rotary-wing vehicles, and supersonic boom mitigation efforts.
– Design and analysis of NASA’s new Space Launch System.
– Analysis of re-entry deceleration concepts for NASA space missions, such as supersonic retro-propulsion and hypersonic inflatable aerodynamic decelerator systems.

– Development of commercial crew spacecraft at companies such as SpaceX.
– Timely analysis of vehicles and weapons systems for U.S. military efforts around the world.
– Efficient green energy concept development, such as wind turbine design and drag minimization for long-haul trucking.

  • Brookhaven National Lab’s Lattice QCD

Lattice QCD, a numerical simulation approach to solve the high-dimensional non-linear problems in strong interactions. Lattice QCD is an indispensable tool for nuclear and particle physics. The heart of Lattice QCD is Monte Carlo simulations, in which the dominating numerical cost (>90%) is the matrix inversion of the type Ax=b or A^+ A x = b. Due to the high numerical cost, Lattice QCD simulations typically run on massively parallel computers and on PC clusters to hundreds of nodes.

The particular code that aimed to port to GPUs using OpenACC is the newly engineered Grid library: http://www.github.com/paboyle/Grid. It is written in C++11 at the top level, with a vectorized data layout and SIMD intrinsics targeting current and upcoming Intel CPUs with long vector registers. Right now it has OpenMP for threading and MPI for communications. The code has a total of about 60,000 lines and is evolving. But the Dslash compute kernel, mainly matrix vector multiplications, that is needed in the matrix inversions is relatively localized.

Right now Grid runs on PC clusters with Intel CPUs, achieves about 25% peak single-node performance in single precision on the Cori Phase I machine with Intel Haswell CPUs (~600 GFlops/node) and is one of the best Lattice QCD CPU codes available to date. Turning on communications drops the performance down to about 1/4 of the single-node performance.

Grid is a new piece of code still under development, so the current user community is limited to the developers and early users. It is expected that eventually Grid will be used by many users in the lattice QCD community on the CORAL machines, and on the exascale computers further down the road. But to ensure that it will have the widest user base, it will need to make it portable across different platforms. Hence the interest to using OpenACC to port it to GPUs.

The constructs of Grid have portability in mind. The OpenMP pragmas are contained in macros, which can be replaced with OpenACC pragmas on the first pass. It will be interesting to see how much more tuning is required to achieve good performance on the GPUs.

  • National Cancer Institute – CBIIT Team

The application determines RNA structure using data from small angle x-ray scattering experiments. The application has been optimized for CPU performance and parallelized with OpenMP and MPI. Preliminary explorations with GPU technologies have been performed and several folds of speedup is expected to be achieved on GPUs. With the increasing need for RNA structures in biological applications and availability of instrument data, the code is expected to have a much broader impact across a large biophysical structure and molecular modeling community.
The primary application runs locally on the Biowulf cluster (non-GPU) and on the Mira system at Argonne National Laboratory.

  • UDEL CIS Dr. John Cazos’s team

The application takes a graph based representation of a program and detects whether that application is malicious and if it is, categorizes it in the appropriate family of malicious code based on its characteristics. This application analyzes program similarity. In particular, the focus is on static program graphs. There is a very large set of graphs (hundreds of thousands, even up to millions) on which similarity analysis needs to be performed. The algorithm currently runs on the CPU and makes use of multi-core CPU, and the goal is to port it to the GPU. A minimum of 78% efficiency is achieved across 16 CPU cores, and a similar efficiency is expected on the GPU since the problem is embarrassingly parallel. The code uses Python and C++ .


  • UDEL Dr. Michael Klein’s Chem & Biomolecular Engineering Team

This application is called the Kinetic Model Builder. As its name might imply, it is used to build models based on chemical engineering principles. In order for a user to obtain a model of their engineering system the following inputs must be given: a reaction network (ex. A->B, BC), properties of species involved, reactor type and conditions, and chemical kinetic parameters. With the aforementioned inputs, the application creates a set of ordinary differential equation based on microkinetics and solves them using the C-based variable order differential equation solver (CVODES) by Lawrence Livermore National Laboratory. If the user has data for the output (ex. output concentrations) of their engineering system, then optimization of kinetic parameters may be achieved via an adapted simulated annealing (ASA) algorithm written by Cal Tech.

Everything in the Kinetic Model Builder is written in C++ with object oriented coding in mind and runs on the CPU. The focus would be to optimize the code. There are 7,433 lines of ASA code. The code of ASA was written in C at CalTech but has been adapted with more C++ characteristics for its use in the Kinetic Model Builder here at UD. In terms of performance it only uses ~20% of the CPU and 20 MB of memory with no noticeable memory leaks. The time to find a solution becomes an issue and scales linearly depending on model size, with larger solution times for larger models. Performance gain is envisioned using GPU.

  • UDEL ECE Dr. Guang Gao’s Team

This application belonging to physics domain is a simple iterative solver that uses a 5-point stencil. The goal is to extend the team’s open source dataflow based runtime system called DARTS to support heterogeneous architectures containing CPU and GPUs tasks. Specifically we are coding a benchmark of a generic 5 points stencil for our RTS that run on both CPU and GPU. The runtime system is about 10000 lines of code. And there are several implementations of the stencil benchmarks each around 500 LOC.

The runtime uses C++ and relies on open source libraries HWLOC and Intel TBB. The application uses both CPU and GPU and the goal is to make the runtime system be able to use both at the same time exploiting the whole machine.

The expectation is to get higher performance than using a classical CPU-based (and CPU-only) approach. Right now the team is seeing ~3× speedups (compared with a sequential execution time) using CUDA. Expectation is to improve this by at least ~6× if possible (this is the max peak performance on selected input sizes using OpenMP)



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s