UDEL GPU Programming Hackathon, 2016

DAY 5 (LAST DAY) May 06 – 2016 (Scroll down for Day 1, 2, 3, 4 updates)

<for more details on the Hackathon>


OMG! I cannot begin to describe what a tremendous experience it was to have these several fabulous teams for a WEEK at UDEL. HATS OFF TO MENTORS FROM NVIDIA, PGI, ORNL, UTK & CORNELL. If you would like to get a 90 seconds overview of this week-long hackathon, check out:

Are you curious how intense the week was? (We had a 7PM reservation for beer night at a local restaurant. It was close to 8PM and nobody stopped hacking or left the room!) :-)!  Well thankfully our reservation wasn’t cancelled when we finally got there :-).


One of the teams from the Chemical & Bimolecular engineering with quite a limited background in Computer Science moved from windows to LINUX this week (naturally got a round of applause for just that) and already noticed improvement in their numbers. #CSforall matters!!! 


Teams that hadn’t used OpenACC before participating in the Hackathon picked up the high-level directive-based model pretty quickly, moved code to TITAN and even started to optimize and observed some speedup! #Directives matter! For one of other teams the jury was still out for OpenACC vs CUDA vs X.  More testing and investigation to-do.

Other random notes: Techniques to reduce launch overhead – merge multiple memcpy into one by allocating one array and doing pointer math. Sometimes CPU can be the bottleneck? Oh – and this shouldn’t be surprising. To get better speedup on GPU you might want to consider moving smaller kernels to CPU.  Size of the data MATTERS to get the BEST out of GPU.

THINK MEMORY FIRST! It’s all about memory. Platforms are only getting more and more complex with deeper and deeper memory hierarchies. Plenty of research to do. Ph.D. students: are you listening?

If you want to make some real progress at a Hackathon, you are better off breaking down a large code (several thousands of LOC) into several sub-problems.

If it is legacy code that is too optimized for CPU, you are better off starting at a less mature point for a GPU implementation.

Depending on code characteristics, it may take quite an amount of refactoring to benefit from GPUs.

Sometimes you may have a mini app working OK with correct results and satisfactory performance, but you may not see a similar outcome on a real app.

Migrating legacy code is tough, time consuming, energy draining with not a lot of hope. However you ‘cannot’ afford to NOT be in the game. Architectures are changing – RAPIDLY. The applications ‘have’ to catch up. Refactoring has to be an option to be seriously considered.

Report compiler bugs. Workaround are quick fixes and not a permanent solution.

Team with a highly complicated C++ code with regular data access patterns (originally designed for OpenMP + MPI + SIMD on Intel KNL) after going through GPU programming experience is now hopeful and quite determined to move their code to GPUs.

Once the mentors are assigned, make sure you bring him/her up to speed on the algorithm, complexity, expectations.

While waving good-bye, one of the teams says “could we have a 2-week hackaton” ? :-)! My eyes lit up! And just that made mine and Fernanda Foertter’s (my co-organizer from Oak Ridge National Lab) day !!!

Programming Exascale machine is a challenge but with such training events, I think it is a pleasant challenge!!!


DAY 4, May 05 – 2016 (Scroll down for Day 1, 2, and 3 updates)

<for more details on the Hackathon>

The Plateau of Enlightenment !!

More bugs and fixes. Eureka moments. Aha moments! Hey – why didn’t I try this on Day 1, moments! And hope these pictures give you an idea of the mood in the room 🙂

Mixed feelings and experiences about using pinned/managed memory, but the bottom line is exploring the best strategy to use memory/cache efficiently is the key to success! Other tips include – expose enough parallelism to saturate the device, keep the data copy back and forth from/to CPU and GPU as minimum as possible, or at least overlap the communication with computation.

You may use ‘async’ clause on parallel or kernel to launch work in queue asynchronously and say execute loops asynchronously; also helps with pipelining, however if the operations were already saturating the device, do not expect the ‘async’ clause to be too helpful or in other words operations to interleave.

Other optimizations that were tried and tested included tiling, nested gangs, loop fission. Some students continued to explore better ways to program on multicore using OpenACC.

‘Hero’ profiler tools – nvprof, TAU among others. Without these tools helping identify optimization opportunities, we would be nowhere! Profilers help identify causes for performance limitations; is it due to memory bandwidth? Compute Resources? Latency issues? One of the teams was digging deeper into tuning low-level parameters by leveraging nvprof output.

By popular request from the teams, an hour was dedicated to learn more about TAU, although I didn’t record this talk, check out a similar talk from the Extreme Scale Computing Program: https://www.youtube.com/watch?v=5ClHXzKF5Fo

And we were sugar high by the end of the day! 🙂

DAY 3, May 04 – 2016 (Scroll down for Day 1 and 2 updates)

<for more details on the Hackathon>

New day, New beginning, New ideas and strategies = Slope of Hope !!!


Some codes captured interesting bugs and corner cases that have been reported and filed. These are usually considered to be the best case studies for the compiler implementations to be improved.



Fine tuned, manageable kernels with reduced LOC are now ported to GPUs using OpenACC. They are performing better than CPUs and the team is looking to further improving the performance by exploring gang, worker and vector levels of parallelism. Another team is investigating how to overlap communication with computation.


Codes with deeper and deeper nested templates could, to an extent be tackled with a motto ‘comment and conquer’. The team is now moving on and considering to look into a smaller c++ code. One of the other teams is exploring OpenACC’s interoperability with OpenMP and even considering to use 2 GPUs. Larger datasets matter here! 

If you don’t already know, OpenACC codes can run on multicore (Note: Use PGI 15.10 onwards if you want to run your OpenACC code on a multicore platform). Do not miss to check out PGI’s Michael Wolfe article on “OpenACC for Multicore CPUs“.

A team with chemical and biomolecular engineeering background that has never used CUDA or programmed on GPUs has now profiled their code to the find the ‘hot spots’ followed by porting the code to TITAN supercomputer and already seeing some speedup! Isn’t that fascinating?!

UDEL GPU Programming Hackathon, 2016

DAY 2, May 03 –  2016 (Scroll down for Day 1 updates)

<for more details on the hackathon>

So what was Day 2 like?

Let’s start with a math equation, shall we? 🙂

Programmers’ patience tested!

Profilers like TAU and nvprof are every team’s best friends at this point. One of the team’s kernel is over 5K LOC showcasing low latency and poor data access pattern. Another team is working on C++ codes and as observed in the past hackathons, it’s been a challenge to use OpenACC on such codes that are deeply nested and heavily templated. One of the other teams is already seeing ~1.2x speedup on GPUs using OpenACC comparing with OpenMP.

Some of the optimizations that the teams have been using include loop reorganization, kernels splitting, and flattening call structures for C++ codes. Some are restructuring their codes and trying to use the memory and caches efficiently.

Corner cases are being reported to the compiler developers and this sort of feedback is really important to improve OpenACC compiler implementations!!

Here you go with a screen full of errors – well actually doesn’t fit within a screen!

Blue screen of despair 😉


UDEL GPU Programming Hackathon, 2016

DAY 1, May 02 – 2016

<for more details on the hackathon>

Today, May 02 2016, GPU Programming Hackathon co-organized with Oak Ridge National kicked off today at the University of Delaware. Dr. Eric Nielsen from NASA Langley gave an invited talk on FUN3D-arge-scale computational fluid dynamics solver for complex aerodynamic flows seen in a broad range of aerospace (and other) applications.

6 teams participate in this hackathon that aims to meet several kinds of expectations a) Educate teams to program GPUs using high-level directive-based programming models, OpenACC, OpenMP c) Train teams to accelerate their codes on GPUs or CPUs d) Provide teams with a clear roadmap on how to tap into massive potential that GPUs can offer.

Several mentors from NVIDIA/PGI, UTK, Cornell, ORNL and UDEL with extensive programming experience are on site at UDEL to work with the teams to facilitate them meet their desired expectations.

Teams are paired up mentors. Goals are set. White boards are filled up with equations, tables, figures and what not! Codes are being migrated to ORNL’s TITAN (world’s second largest supercomputer) as we speak!

It’s Showtime, folks!!

The participating teams are:

  • NASA Langley

FUN3D is a large-scale computational fluid dynamics solver for complex aerodynamic flows seen in a broad range of aerospace (and other) applications. FUN3D solves the Navier-Stokes equations on unstructured meshes using a node-based finite volume spatial discretization with implicit time advancement. FUN3D is approximately 800,000 lines of code.

FUN3D is predominantly written in Fortran 200x. The code can make use of a broad range of third-party libraries, depending on the application. At a minimum, an MPI implementation must be available, as well as one of the supported partitioning packages (Metis, ParMetis, or Zoltan).

FUN3D is used across the country on a wide range of systems, from desktops to large HPC resources. The code has been run out to 80,000 cores (CPU) on TITAN. Select kernels have been ported to GPU, with the majority of effort to-date spent on the workhorse linear solver (multicolor point-implicit). Some OpenACC, some CUDA Fortran through an ongoing collaboration with NVIDIA.

FUN3D is widely used to support major national research and engineering efforts, both within NASA and among groups across U.S. industry, the Department of Defense, and academia. A past collaboration with the Department of Energy received the Gordon Bell Prize. Some applications that FUN3D currently supports include:

– NASA aeronautics research, spanning fixed-wing applications, rotary-wing vehicles, and supersonic boom mitigation efforts.
– Design and analysis of NASA’s new Space Launch System.
– Analysis of re-entry deceleration concepts for NASA space missions, such as supersonic retro-propulsion and hypersonic inflatable aerodynamic decelerator systems.

– Development of commercial crew spacecraft at companies such as SpaceX.
– Timely analysis of vehicles and weapons systems for U.S. military efforts around the world.
– Efficient green energy concept development, such as wind turbine design and drag minimization for long-haul trucking.

  • Brookhaven National Lab’s Lattice QCD

Lattice QCD, a numerical simulation approach to solve the high-dimensional non-linear problems in strong interactions. Lattice QCD is an indispensable tool for nuclear and particle physics. The heart of Lattice QCD is Monte Carlo simulations, in which the dominating numerical cost (>90%) is the matrix inversion of the type Ax=b or A^+ A x = b. Due to the high numerical cost, Lattice QCD simulations typically run on massively parallel computers and on PC clusters to hundreds of nodes.

The particular code that aimed to port to GPUs using OpenACC is the newly engineered Grid library: http://www.github.com/paboyle/Grid. It is written in C++11 at the top level, with a vectorized data layout and SIMD intrinsics targeting current and upcoming Intel CPUs with long vector registers. Right now it has OpenMP for threading and MPI for communications. The code has a total of about 60,000 lines and is evolving. But the Dslash compute kernel, mainly matrix vector multiplications, that is needed in the matrix inversions is relatively localized.

Right now Grid runs on PC clusters with Intel CPUs, achieves about 25% peak single-node performance in single precision on the Cori Phase I machine with Intel Haswell CPUs (~600 GFlops/node) and is one of the best Lattice QCD CPU codes available to date. Turning on communications drops the performance down to about 1/4 of the single-node performance.

Grid is a new piece of code still under development, so the current user community is limited to the developers and early users. It is expected that eventually Grid will be used by many users in the lattice QCD community on the CORAL machines, and on the exascale computers further down the road. But to ensure that it will have the widest user base, it will need to make it portable across different platforms. Hence the interest to using OpenACC to port it to GPUs.

The constructs of Grid have portability in mind. The OpenMP pragmas are contained in macros, which can be replaced with OpenACC pragmas on the first pass. It will be interesting to see how much more tuning is required to achieve good performance on the GPUs.

  • National Cancer Institute – CBIIT Team

The application determines RNA structure using data from small angle x-ray scattering experiments. The application has been optimized for CPU performance and parallelized with OpenMP and MPI. Preliminary explorations with GPU technologies have been performed and several folds of speedup is expected to be achieved on GPUs. With the increasing need for RNA structures in biological applications and availability of instrument data, the code is expected to have a much broader impact across a large biophysical structure and molecular modeling community.
The primary application runs locally on the Biowulf cluster (non-GPU) and on the Mira system at Argonne National Laboratory.

  • UDEL CIS Dr. John Cazos’s team

The application takes a graph based representation of a program and detects whether that application is malicious and if it is, categorizes it in the appropriate family of malicious code based on its characteristics. This application analyzes program similarity. In particular, the focus is on static program graphs. There is a very large set of graphs (hundreds of thousands, even up to millions) on which similarity analysis needs to be performed. The algorithm currently runs on the CPU and makes use of multi-core CPU, and the goal is to port it to the GPU. A minimum of 78% efficiency is achieved across 16 CPU cores, and a similar efficiency is expected on the GPU since the problem is embarrassingly parallel. The code uses Python and C++ .


  • UDEL Dr. Michael Klein’s Chem & Biomolecular Engineering Team

This application is called the Kinetic Model Builder. As its name might imply, it is used to build models based on chemical engineering principles. In order for a user to obtain a model of their engineering system the following inputs must be given: a reaction network (ex. A->B, BC), properties of species involved, reactor type and conditions, and chemical kinetic parameters. With the aforementioned inputs, the application creates a set of ordinary differential equation based on microkinetics and solves them using the C-based variable order differential equation solver (CVODES) by Lawrence Livermore National Laboratory. If the user has data for the output (ex. output concentrations) of their engineering system, then optimization of kinetic parameters may be achieved via an adapted simulated annealing (ASA) algorithm written by Cal Tech.

Everything in the Kinetic Model Builder is written in C++ with object oriented coding in mind and runs on the CPU. The focus would be to optimize the code. There are 7,433 lines of ASA code. The code of ASA was written in C at CalTech but has been adapted with more C++ characteristics for its use in the Kinetic Model Builder here at UD. In terms of performance it only uses ~20% of the CPU and 20 MB of memory with no noticeable memory leaks. The time to find a solution becomes an issue and scales linearly depending on model size, with larger solution times for larger models. Performance gain is envisioned using GPU.

  • UDEL ECE Dr. Guang Gao’s Team

This application belonging to physics domain is a simple iterative solver that uses a 5-point stencil. The goal is to extend the team’s open source dataflow based runtime system called DARTS to support heterogeneous architectures containing CPU and GPUs tasks. Specifically we are coding a benchmark of a generic 5 points stencil for our RTS that run on both CPU and GPU. The runtime system is about 10000 lines of code. And there are several implementations of the stencil benchmarks each around 500 LOC.

The runtime uses C++ and relies on open source libraries HWLOC and Intel TBB. The application uses both CPU and GPU and the goal is to make the runtime system be able to use both at the same time exploiting the whole machine.

The expectation is to get higher performance than using a classical CPU-based (and CPU-only) approach. Right now the team is seeing ~3× speedups (compared with a sequential execution time) using CUDA. Expectation is to improve this by at least ~6× if possible (this is the max peak performance on selected input sizes using OpenMP)