OpenACC Tutorial + Workshop

Thank you for taking the time to read my blog on UDEL’s GPU Hackathon and checking out the article 🙂

After the success of the  GPU Hackathon (linked to the video) in early May this year, NVIDIA and PGI decided to come back to University of Delaware campus to host lectures and workshops on OpenACC from June 7th through June 9th, 2016.

Yay!!!! Thank you NVIDIA/PGI! 🙂

Tutorial on Day 1 was open to all registrants.

Presented by NVIDIA team members Mathew Colgrove, Abel Brown and Barton Fiske, the workshop introduced programming techniques using OpenACC and included topics such as optimization and profiling methods for GPU programming. The teams used PGI OpenACC compilers.

IMG_0976

Barton Fiske of NVIDIA kicked-off the 3-day workshop by sharing how NVIDIA has emerged as the world leader in visual computing along with the different programs NVIDIA has to offer for academia including the teaching kit.

gpu

The lectures and hands-on exercises were given by Mathew Colgrove of NVIDIA’s PGI compiler team covering the use of OpenACC on GPU accelerated systems. GPUs, as you know, are the most pervasive parallel computing model, used by over 300,000 developers worldwide.

Attendees included faculty, undergraduate, post-graduate students and research scientists from Computer & Information Sciences, Electrical and Computer Engineering, Chemical and Biomolecular Engineering, Mechanical Engineering of the University of Delaware, faculty and students from Prof. Tomasz Smolinski’s team @ Delaware State University and from Prof. Haklin Kimm @ East Stroudsbrug University, Pennsylvania. IMG_2298.JPG

When you are teaching non-CS students, CS – don’t you feel you are on top of the world? 🙂 I do! Interdisciplinary research is so important!

IMG_0977

OpenACC compilers can also target multicore platforms. Yes, they can. Read more.

So if you want to try OpenACC programming model on your quad-core or dual-core laptop, you would simply download the OpenACC Toolkit (free for academia) that includes the popular PGI Accelerator Fortran/C Compiler and developer tools for acceleration with OpenACC. (this is just in case you do not have a PGI compiler license, yet). More useful resources along with online course materials.

 

Mini-workshop on Day 2 and Day 3: 

A team of 4 from East Stroudsbrug University (ESU) and a team of 6 from Delaware State University (DSU) worked on parallelizing their codes that represented Evolutionary Algorithm, Dynamic Programming Algorithm and Satellite Image Processing Algorithm.

The team from ESU were using Matlab for image processing and have been trying to use OpenMP and OpenACC directives. To them it seemed as though it was impossible to use directives for their image processing code.

A CS Masters student, Aakashdeep Goyal from ESU says “The workshop was not only limited to discuss the OpenACC framework but also provided a background study of the various existing parallel processing alternatives through open discussions.”

So this is the part I enjoy the most about the Hackathon as well as the workshop. It’s just a brilliant forum to brainstorm ideas on the white board with mentors and a bunch of eagerly-awaiting-to-learn participants. IMG_0985This group learnt that “libraries” are the way to go!! NVdians helped the team use  OpenCV libraries instead of MATLAB and were able to integrate that with OpenMP on Ubuntu 14.04. The team used Eclipse for the same. Since the code was in MATLAB to begin with, the team spent both the days converting the code to OpenCV.

Now that the team has undergone vigorous training on OpenACC and know to use OpenCV, they plan to use OpenACC directives for C++ enabled OpenCV and later on using CUDA. The aim is to test the non-parametric regression model along with other filtering algorithms for edge detection and linkage using the OpenACC directives. The team is confident to have a working OpenACC code within the next several weeks. (This sounds positive so stay tuned for updates :-)!)

Another algorithm that one of their team members, Zuqing Z Li presented, was the Dynamic Programming algorithm. This is a classic wavefront-based problem! Every cell depends on all of its neighboring cells making it a very interesting problem since unless you fully compute the upper triangle, you cannot compute the cells of the leading diagonal and so on and so forth. There are other research groups that have used CUDA on exploiting wavefront parallelization. So we discussed with the team some of the CUDA strategies that could be transformed to OpenACC and the team is looking forward to implementing some of those strategies and probably even use MPI + OpenACC across nodes.

IMG_0986

The team from DSU brainstormed parallelization of an Evolutionary Algorithm. These are algorithms inspired by the biological model of evolution. Genetic Algorithm (GA) is the most common type of Evolutionary Algorithm. The team came with the bulk of the evolutionary library to the workshop but their goal was to learn ways to parallelize the algorithm. As the slide presented by Prof. Tomasz Smolinski shows, the library in c++ was in development since 1997 (lots of legacy code!!!)

IMG_0979The team’s goal was to transplant their Multi-Objective Evolutionary Algorithms (MOEA) library onto the GPU platform. The library is application-agnostic, and has been successfully utilized in various domains, including computational modeling of neurons, signal decomposition, and mining for association rules in large data sets. Ultimately, the library will be the engine behind their open-source application, called NeRvolver, which will allow users from all over the world, through a web interface, generate and analyze neuronal models.

IMG_0982The team spent most of Day 2 brainstorming with Tristan Vanderbruggen and Robert Searles – mentors from University of Delaware and expert programmers of accelerators, about how to manage moving data to and from the host and the device.

Ahaa moment !! After several white board sessions, the conclusion was that the new code would create the initial genotypes on the GPU, after which crossover and mutation would occur. Then these individuals would be sent to the simulator, which returns the fitness values of these models to the GPU. On the GPU, they also hoped to store their archive of elite models, which would be updated throughout the simulation.

But wait a minute – that was not all of it, there was yet another challenge- the size of the archive would change over time and become larger than the population (i.e. size of each generation) and therefore how to allocate the appropriate space on the device???

Well – I guess they were glad that they have identified the challenge! 🙂 Sometimes finding the problem can be a challenge (Now, how many of you have experienced that!! ;-))

By the end of Day 3, with Mat Colgrove’s help, the team had a working OpenACC C++ code of the algorithm!!! The code had several compute kernels denoting it was thoroughly compute-intensive and could benefit from GPUs while using OpenACC.

Although they are at the beginning of the tunnel at the moment, Karla M Miletti, an undergraduate student at the CIS department from DSU is hopeful to take this to the next level. She says:

“Before Wednesday we were simply hopeful we could use GPU’s to optimize our algorithm since the evolutionary library actually passes control to a simulator (such as Neuron) which usually runs sequentially. However I think we managed to find a good application of OpenACC and high performance computing to our evolutionary algorithm. Eventually we hope to figure out how to parallelize the simulator”.

 

UDEL GPU Programming Hackathon, 2016

DAY 5 (LAST DAY) May 06 – 2016 (Scroll down for Day 1, 2, 3, 4 updates)

<for more details on the Hackathon>

To SUMMARIZE:

OMG! I cannot begin to describe what a tremendous experience it was to have these several fabulous teams for a WEEK at UDEL. HATS OFF TO MENTORS FROM NVIDIA, PGI, ORNL, UTK & CORNELL. If you would like to get a 90 seconds overview of this week-long hackathon, check out:

Are you curious how intense the week was? (We had a 7PM reservation for beer night at a local restaurant. It was close to 8PM and nobody stopped hacking or left the room!) :-)!  Well thankfully our reservation wasn’t cancelled when we finally got there :-).

IMG_8002

One of the teams from the Chemical & Bimolecular engineering with quite a limited background in Computer Science moved from windows to LINUX this week (naturally got a round of applause for just that) and already noticed improvement in their numbers. #CSforall matters!!! 

IMG_0649

Teams that hadn’t used OpenACC before participating in the Hackathon picked up the high-level directive-based model pretty quickly, moved code to TITAN and even started to optimize and observed some speedup! #Directives matter! For one of other teams the jury was still out for OpenACC vs CUDA vs X.  More testing and investigation to-do.

Other random notes: Techniques to reduce launch overhead – merge multiple memcpy into one by allocating one array and doing pointer math. Sometimes CPU can be the bottleneck? Oh – and this shouldn’t be surprising. To get better speedup on GPU you might want to consider moving smaller kernels to CPU.  Size of the data MATTERS to get the BEST out of GPU.

THINK MEMORY FIRST! It’s all about memory. Platforms are only getting more and more complex with deeper and deeper memory hierarchies. Plenty of research to do. Ph.D. students: are you listening?

If you want to make some real progress at a Hackathon, you are better off breaking down a large code (several thousands of LOC) into several sub-problems.

If it is legacy code that is too optimized for CPU, you are better off starting at a less mature point for a GPU implementation.

Depending on code characteristics, it may take quite an amount of refactoring to benefit from GPUs.

Sometimes you may have a mini app working OK with correct results and satisfactory performance, but you may not see a similar outcome on a real app.

Migrating legacy code is tough, time consuming, energy draining with not a lot of hope. However you ‘cannot’ afford to NOT be in the game. Architectures are changing – RAPIDLY. The applications ‘have’ to catch up. Refactoring has to be an option to be seriously considered.

Report compiler bugs. Workaround are quick fixes and not a permanent solution.

Team with a highly complicated C++ code with regular data access patterns (originally designed for OpenMP + MPI + SIMD on Intel KNL) after going through GPU programming experience is now hopeful and quite determined to move their code to GPUs.

Once the mentors are assigned, make sure you bring him/her up to speed on the algorithm, complexity, expectations.

While waving good-bye, one of the teams says “could we have a 2-week hackaton” ? :-)! My eyes lit up! And just that made mine and Fernanda Foertter’s (my co-organizer from Oak Ridge National Lab) day !!!

Programming Exascale machine is a challenge but with such training events, I think it is a pleasant challenge!!!

IMG_7984.JPG

DAY 4, May 05 – 2016 (Scroll down for Day 1, 2, and 3 updates)

<for more details on the Hackathon>

The Plateau of Enlightenment !!

More bugs and fixes. Eureka moments. Aha moments! Hey – why didn’t I try this on Day 1, moments! And hope these pictures give you an idea of the mood in the room 🙂

Mixed feelings and experiences about using pinned/managed memory, but the bottom line is exploring the best strategy to use memory/cache efficiently is the key to success! Other tips include – expose enough parallelism to saturate the device, keep the data copy back and forth from/to CPU and GPU as minimum as possible, or at least overlap the communication with computation.

You may use ‘async’ clause on parallel or kernel to launch work in queue asynchronously and say execute loops asynchronously; also helps with pipelining, however if the operations were already saturating the device, do not expect the ‘async’ clause to be too helpful or in other words operations to interleave.

Other optimizations that were tried and tested included tiling, nested gangs, loop fission. Some students continued to explore better ways to program on multicore using OpenACC.

‘Hero’ profiler tools – nvprof, TAU among others. Without these tools helping identify optimization opportunities, we would be nowhere! Profilers help identify causes for performance limitations; is it due to memory bandwidth? Compute Resources? Latency issues? One of the teams was digging deeper into tuning low-level parameters by leveraging nvprof output.

By popular request from the teams, an hour was dedicated to learn more about TAU, although I didn’t record this talk, check out a similar talk from the Extreme Scale Computing Program: https://www.youtube.com/watch?v=5ClHXzKF5Fo

And we were sugar high by the end of the day! 🙂

DAY 3, May 04 – 2016 (Scroll down for Day 1 and 2 updates)

<for more details on the Hackathon>

New day, New beginning, New ideas and strategies = Slope of Hope !!!

 

Some codes captured interesting bugs and corner cases that have been reported and filed. These are usually considered to be the best case studies for the compiler implementations to be improved.

bug.jpg

 

Fine tuned, manageable kernels with reduced LOC are now ported to GPUs using OpenACC. They are performing better than CPUs and the team is looking to further improving the performance by exploring gang, worker and vector levels of parallelism. Another team is investigating how to overlap communication with computation.

LOC

Codes with deeper and deeper nested templates could, to an extent be tackled with a motto ‘comment and conquer’. The team is now moving on and considering to look into a smaller c++ code. One of the other teams is exploring OpenACC’s interoperability with OpenMP and even considering to use 2 GPUs. Larger datasets matter here! 

If you don’t already know, OpenACC codes can run on multicore (Note: Use PGI 15.10 onwards if you want to run your OpenACC code on a multicore platform). Do not miss to check out PGI’s Michael Wolfe article on “OpenACC for Multicore CPUs“.

A team with chemical and biomolecular engineeering background that has never used CUDA or programmed on GPUs has now profiled their code to the find the ‘hot spots’ followed by porting the code to TITAN supercomputer and already seeing some speedup! Isn’t that fascinating?!

UDEL GPU Programming Hackathon, 2016

DAY 2, May 03 –  2016 (Scroll down for Day 1 updates)

<for more details on the hackathon>

So what was Day 2 like?

Let’s start with a math equation, shall we? 🙂

Programmers’ patience tested!

Profilers like TAU and nvprof are every team’s best friends at this point. One of the team’s kernel is over 5K LOC showcasing low latency and poor data access pattern. Another team is working on C++ codes and as observed in the past hackathons, it’s been a challenge to use OpenACC on such codes that are deeply nested and heavily templated. One of the other teams is already seeing ~1.2x speedup on GPUs using OpenACC comparing with OpenMP.

Some of the optimizations that the teams have been using include loop reorganization, kernels splitting, and flattening call structures for C++ codes. Some are restructuring their codes and trying to use the memory and caches efficiently.

Corner cases are being reported to the compiler developers and this sort of feedback is really important to improve OpenACC compiler implementations!!

Here you go with a screen full of errors – well actually doesn’t fit within a screen!

Blue screen of despair 😉

 

UDEL GPU Programming Hackathon, 2016

DAY 1, May 02 – 2016

<for more details on the hackathon>

Today, May 02 2016, GPU Programming Hackathon co-organized with Oak Ridge National kicked off today at the University of Delaware. Dr. Eric Nielsen from NASA Langley gave an invited talk on FUN3D-arge-scale computational fluid dynamics solver for complex aerodynamic flows seen in a broad range of aerospace (and other) applications.

6 teams participate in this hackathon that aims to meet several kinds of expectations a) Educate teams to program GPUs using high-level directive-based programming models, OpenACC, OpenMP c) Train teams to accelerate their codes on GPUs or CPUs d) Provide teams with a clear roadmap on how to tap into massive potential that GPUs can offer.

Several mentors from NVIDIA/PGI, UTK, Cornell, ORNL and UDEL with extensive programming experience are on site at UDEL to work with the teams to facilitate them meet their desired expectations.

Teams are paired up mentors. Goals are set. White boards are filled up with equations, tables, figures and what not! Codes are being migrated to ORNL’s TITAN (world’s second largest supercomputer) as we speak!

It’s Showtime, folks!!

The participating teams are:

  • NASA Langley

FUN3D is a large-scale computational fluid dynamics solver for complex aerodynamic flows seen in a broad range of aerospace (and other) applications. FUN3D solves the Navier-Stokes equations on unstructured meshes using a node-based finite volume spatial discretization with implicit time advancement. FUN3D is approximately 800,000 lines of code.

FUN3D is predominantly written in Fortran 200x. The code can make use of a broad range of third-party libraries, depending on the application. At a minimum, an MPI implementation must be available, as well as one of the supported partitioning packages (Metis, ParMetis, or Zoltan).

FUN3D is used across the country on a wide range of systems, from desktops to large HPC resources. The code has been run out to 80,000 cores (CPU) on TITAN. Select kernels have been ported to GPU, with the majority of effort to-date spent on the workhorse linear solver (multicolor point-implicit). Some OpenACC, some CUDA Fortran through an ongoing collaboration with NVIDIA.

FUN3D is widely used to support major national research and engineering efforts, both within NASA and among groups across U.S. industry, the Department of Defense, and academia. A past collaboration with the Department of Energy received the Gordon Bell Prize. Some applications that FUN3D currently supports include:

– NASA aeronautics research, spanning fixed-wing applications, rotary-wing vehicles, and supersonic boom mitigation efforts.
– Design and analysis of NASA’s new Space Launch System.
– Analysis of re-entry deceleration concepts for NASA space missions, such as supersonic retro-propulsion and hypersonic inflatable aerodynamic decelerator systems.

– Development of commercial crew spacecraft at companies such as SpaceX.
– Timely analysis of vehicles and weapons systems for U.S. military efforts around the world.
– Efficient green energy concept development, such as wind turbine design and drag minimization for long-haul trucking.

  • Brookhaven National Lab’s Lattice QCD

Lattice QCD, a numerical simulation approach to solve the high-dimensional non-linear problems in strong interactions. Lattice QCD is an indispensable tool for nuclear and particle physics. The heart of Lattice QCD is Monte Carlo simulations, in which the dominating numerical cost (>90%) is the matrix inversion of the type Ax=b or A^+ A x = b. Due to the high numerical cost, Lattice QCD simulations typically run on massively parallel computers and on PC clusters to hundreds of nodes.

The particular code that aimed to port to GPUs using OpenACC is the newly engineered Grid library: http://www.github.com/paboyle/Grid. It is written in C++11 at the top level, with a vectorized data layout and SIMD intrinsics targeting current and upcoming Intel CPUs with long vector registers. Right now it has OpenMP for threading and MPI for communications. The code has a total of about 60,000 lines and is evolving. But the Dslash compute kernel, mainly matrix vector multiplications, that is needed in the matrix inversions is relatively localized.

Right now Grid runs on PC clusters with Intel CPUs, achieves about 25% peak single-node performance in single precision on the Cori Phase I machine with Intel Haswell CPUs (~600 GFlops/node) and is one of the best Lattice QCD CPU codes available to date. Turning on communications drops the performance down to about 1/4 of the single-node performance.

Grid is a new piece of code still under development, so the current user community is limited to the developers and early users. It is expected that eventually Grid will be used by many users in the lattice QCD community on the CORAL machines, and on the exascale computers further down the road. But to ensure that it will have the widest user base, it will need to make it portable across different platforms. Hence the interest to using OpenACC to port it to GPUs.

The constructs of Grid have portability in mind. The OpenMP pragmas are contained in macros, which can be replaced with OpenACC pragmas on the first pass. It will be interesting to see how much more tuning is required to achieve good performance on the GPUs.

  • National Cancer Institute – CBIIT Team

The application determines RNA structure using data from small angle x-ray scattering experiments. The application has been optimized for CPU performance and parallelized with OpenMP and MPI. Preliminary explorations with GPU technologies have been performed and several folds of speedup is expected to be achieved on GPUs. With the increasing need for RNA structures in biological applications and availability of instrument data, the code is expected to have a much broader impact across a large biophysical structure and molecular modeling community.
The primary application runs locally on the Biowulf cluster (non-GPU) and on the Mira system at Argonne National Laboratory.

  • UDEL CIS Dr. John Cazos’s team

The application takes a graph based representation of a program and detects whether that application is malicious and if it is, categorizes it in the appropriate family of malicious code based on its characteristics. This application analyzes program similarity. In particular, the focus is on static program graphs. There is a very large set of graphs (hundreds of thousands, even up to millions) on which similarity analysis needs to be performed. The algorithm currently runs on the CPU and makes use of multi-core CPU, and the goal is to port it to the GPU. A minimum of 78% efficiency is achieved across 16 CPU cores, and a similar efficiency is expected on the GPU since the problem is embarrassingly parallel. The code uses Python and C++ .

which

  • UDEL Dr. Michael Klein’s Chem & Biomolecular Engineering Team

This application is called the Kinetic Model Builder. As its name might imply, it is used to build models based on chemical engineering principles. In order for a user to obtain a model of their engineering system the following inputs must be given: a reaction network (ex. A->B, BC), properties of species involved, reactor type and conditions, and chemical kinetic parameters. With the aforementioned inputs, the application creates a set of ordinary differential equation based on microkinetics and solves them using the C-based variable order differential equation solver (CVODES) by Lawrence Livermore National Laboratory. If the user has data for the output (ex. output concentrations) of their engineering system, then optimization of kinetic parameters may be achieved via an adapted simulated annealing (ASA) algorithm written by Cal Tech.

Everything in the Kinetic Model Builder is written in C++ with object oriented coding in mind and runs on the CPU. The focus would be to optimize the code. There are 7,433 lines of ASA code. The code of ASA was written in C at CalTech but has been adapted with more C++ characteristics for its use in the Kinetic Model Builder here at UD. In terms of performance it only uses ~20% of the CPU and 20 MB of memory with no noticeable memory leaks. The time to find a solution becomes an issue and scales linearly depending on model size, with larger solution times for larger models. Performance gain is envisioned using GPU.

  • UDEL ECE Dr. Guang Gao’s Team

This application belonging to physics domain is a simple iterative solver that uses a 5-point stencil. The goal is to extend the team’s open source dataflow based runtime system called DARTS to support heterogeneous architectures containing CPU and GPUs tasks. Specifically we are coding a benchmark of a generic 5 points stencil for our RTS that run on both CPU and GPU. The runtime system is about 10000 lines of code. And there are several implementations of the stencil benchmarks each around 500 LOC.

The runtime uses C++ and relies on open source libraries HWLOC and Intel TBB. The application uses both CPU and GPU and the goal is to make the runtime system be able to use both at the same time exploiting the whole machine.

The expectation is to get higher performance than using a classical CPU-based (and CPU-only) approach. Right now the team is seeing ~3× speedups (compared with a sequential execution time) using CUDA. Expectation is to improve this by at least ~6× if possible (this is the max peak performance on selected input sizes using OpenMP)