UDEL GPU Programming Hackathon, 2016

DAY 5 (LAST DAY) May 06 – 2016 (Scroll down for Day 1, 2, 3, 4 updates)

<for more details on the Hackathon>


OMG! I cannot begin to describe what a tremendous experience it was to have these several fabulous teams for a WEEK at UDEL. HATS OFF TO MENTORS FROM NVIDIA, PGI, ORNL, UTK & CORNELL. If you would like to get a 90 seconds overview of this week-long hackathon, check out:

Are you curious how intense the week was? (We had a 7PM reservation for beer night at a local restaurant. It was close to 8PM and nobody stopped hacking or left the room!) :-)!  Well thankfully our reservation wasn’t cancelled when we finally got there :-).


One of the teams from the Chemical & Bimolecular engineering with quite a limited background in Computer Science moved from windows to LINUX this week (naturally got a round of applause for just that) and already noticed improvement in their numbers. #CSforall matters!!! 


Teams that hadn’t used OpenACC before participating in the Hackathon picked up the high-level directive-based model pretty quickly, moved code to TITAN and even started to optimize and observed some speedup! #Directives matter! For one of other teams the jury was still out for OpenACC vs CUDA vs X.  More testing and investigation to-do.

Other random notes: Techniques to reduce launch overhead – merge multiple memcpy into one by allocating one array and doing pointer math. Sometimes CPU can be the bottleneck? Oh – and this shouldn’t be surprising. To get better speedup on GPU you might want to consider moving smaller kernels to CPU.  Size of the data MATTERS to get the BEST out of GPU.

THINK MEMORY FIRST! It’s all about memory. Platforms are only getting more and more complex with deeper and deeper memory hierarchies. Plenty of research to do. Ph.D. students: are you listening?

If you want to make some real progress at a Hackathon, you are better off breaking down a large code (several thousands of LOC) into several sub-problems.

If it is legacy code that is too optimized for CPU, you are better off starting at a less mature point for a GPU implementation.

Depending on code characteristics, it may take quite an amount of refactoring to benefit from GPUs.

Sometimes you may have a mini app working OK with correct results and satisfactory performance, but you may not see a similar outcome on a real app.

Migrating legacy code is tough, time consuming, energy draining with not a lot of hope. However you ‘cannot’ afford to NOT be in the game. Architectures are changing – RAPIDLY. The applications ‘have’ to catch up. Refactoring has to be an option to be seriously considered.

Report compiler bugs. Workaround are quick fixes and not a permanent solution.

Team with a highly complicated C++ code with regular data access patterns (originally designed for OpenMP + MPI + SIMD on Intel KNL) after going through GPU programming experience is now hopeful and quite determined to move their code to GPUs.

Once the mentors are assigned, make sure you bring him/her up to speed on the algorithm, complexity, expectations.

While waving good-bye, one of the teams says “could we have a 2-week hackaton” ? :-)! My eyes lit up! And just that made mine and Fernanda Foertter’s (my co-organizer from Oak Ridge National Lab) day !!!

Programming Exascale machine is a challenge but with such training events, I think it is a pleasant challenge!!!


DAY 4, May 05 – 2016 (Scroll down for Day 1, 2, and 3 updates)

<for more details on the Hackathon>

The Plateau of Enlightenment !!

More bugs and fixes. Eureka moments. Aha moments! Hey – why didn’t I try this on Day 1, moments! And hope these pictures give you an idea of the mood in the room 🙂

Mixed feelings and experiences about using pinned/managed memory, but the bottom line is exploring the best strategy to use memory/cache efficiently is the key to success! Other tips include – expose enough parallelism to saturate the device, keep the data copy back and forth from/to CPU and GPU as minimum as possible, or at least overlap the communication with computation.

You may use ‘async’ clause on parallel or kernel to launch work in queue asynchronously and say execute loops asynchronously; also helps with pipelining, however if the operations were already saturating the device, do not expect the ‘async’ clause to be too helpful or in other words operations to interleave.

Other optimizations that were tried and tested included tiling, nested gangs, loop fission. Some students continued to explore better ways to program on multicore using OpenACC.

‘Hero’ profiler tools – nvprof, TAU among others. Without these tools helping identify optimization opportunities, we would be nowhere! Profilers help identify causes for performance limitations; is it due to memory bandwidth? Compute Resources? Latency issues? One of the teams was digging deeper into tuning low-level parameters by leveraging nvprof output.

By popular request from the teams, an hour was dedicated to learn more about TAU, although I didn’t record this talk, check out a similar talk from the Extreme Scale Computing Program: https://www.youtube.com/watch?v=5ClHXzKF5Fo

And we were sugar high by the end of the day! 🙂

DAY 3, May 04 – 2016 (Scroll down for Day 1 and 2 updates)

<for more details on the Hackathon>

New day, New beginning, New ideas and strategies = Slope of Hope !!!


Some codes captured interesting bugs and corner cases that have been reported and filed. These are usually considered to be the best case studies for the compiler implementations to be improved.



Fine tuned, manageable kernels with reduced LOC are now ported to GPUs using OpenACC. They are performing better than CPUs and the team is looking to further improving the performance by exploring gang, worker and vector levels of parallelism. Another team is investigating how to overlap communication with computation.


Codes with deeper and deeper nested templates could, to an extent be tackled with a motto ‘comment and conquer’. The team is now moving on and considering to look into a smaller c++ code. One of the other teams is exploring OpenACC’s interoperability with OpenMP and even considering to use 2 GPUs. Larger datasets matter here! 

If you don’t already know, OpenACC codes can run on multicore (Note: Use PGI 15.10 onwards if you want to run your OpenACC code on a multicore platform). Do not miss to check out PGI’s Michael Wolfe article on “OpenACC for Multicore CPUs“.

A team with chemical and biomolecular engineeering background that has never used CUDA or programmed on GPUs has now profiled their code to the find the ‘hot spots’ followed by porting the code to TITAN supercomputer and already seeing some speedup! Isn’t that fascinating?!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s