Comments on: Julia Still Not Grown Up Enough to Ride Exascale Train

By: Slim Albert

Slim Albert — Sun, 01 Oct 2023 02:02:10 +0000

In reply to Hubert. Interesting -- Numenta (eg. S. Ahmad, B.S. Cornell, M.S. & PhD. UIUC) champions Spatial Pooling (SP) within a Hierarchical Temporal Memory (HTM) perspective. I wonder how much of it could be implemented in memory-controller hardware (maybe with memory hint instructions from the CPU/code), or if it works best as a static, pre-compiled, optimization?

By: Hubert

Hubert — Fri, 29 Sep 2023 14:03:33 +0000

In reply to J. J. Ramsey. Sounds about right (and may connect back to Eric's point in a way). Taking memory access patterns (as specified in the algorithm being compiled or JIT-ted) into account when lowering (adapting, splitting) the computational graph to the hardware, is likely a key factor in eventual performance (in face of the memory wall). The "Numenta" outfit apparently makes even CPUs run AI/ML/HPC faster by focusing on that.

By: J. J. Ramsey

J. J. Ramsey — Thu, 28 Sep 2023 17:27:50 +0000

One thing I noticed is that while the paper showed an AMDGPU kernel written in Julia, it didn’t, as far as I could tell, say much about how data was copied from the CPU to the GPU and back. For all I know, the performance lag could have come from, say, an automatic behind-the-scenes transfer that’s easy to code but not performant, similar to how Nvidia’s Unified Memory works without prefetching.

By: Slim Albert

Slim Albert — Wed, 27 Sep 2023 19:33:05 +0000

In reply to Eric Olson. I totally agree! Julia is great connective tissue for HPC and AI, and AMD has the hardware-specific expertise to help optimize its performance. I don't know how many PhDs they have working on software infrastructure (NVIDIA has approx 300 based on Tim's estimate), but dispatching a few to collaborate on Julia's JIT & HAL should be well worth it in my opinion!

By: Eric Olson

Eric Olson — Wed, 27 Sep 2023 06:59:42 +0000

In my opinion getting a GPU to work at all using anything but the vendor supplied compiler is pretty amazing. Julia does this with efficient MPI communication while further supports an interactive read-evaluate-print loop. Upon looking at the code in the paper I’m wondering about the 6-condition if statement in kernel_amdgpu that checks whether a point is on the domain boundary. My understanding is such conditionals make little difference for CPU calculations because of speculative execution but can significantly slow down GPU processing.

While it’s admittedly unlikely my 5 minute code review found anything significant, I can’t help wonder whether removing those conditional might recover some of the performance between the HIP version using the vendor-optimized solver and Julia.

By: Slim Albert

Slim Albert — Wed, 27 Sep 2023 00:56:19 +0000

Great article! It adds to TNP’s 01/05/22 piece “Julia Across HPC”, where Julia was tested on CPUs and GPUs, and for which the journal article is now open-access at bris.ac.uk. Today’s brand new paper, about Julia on Frontier, will be presented at SC23 in Denver (bottom-left of 1st page), which TNP Experts will once again absolutely need to definitely attend, to bring us all the latest and greatest on HPC recipes, cookware, and master-chef competitions!

Julia is a great concept for HPC gastronomy in my opinion. It has a decently-sized user-group, good performance, and hardware-agnosticism. The competition is heating-up in this space of effective high-level HPC/AI languages, especially with Mojo’s snaky python-orientation, and it should be interesting to see if the type-inference convergence algorithm 2.0 (or newer) in Julia’s JIT can match or beat explicit static typing in Mojo (where it led to 400x speedup relative to a very naive Python matmul).

It is notable that in this Julia-Frontier paper, ORNL uses solutions of the Gray-Scott coupled Nonlinear reactive-diffusive pair of PDEs to benchmark the system. These PDEs produce Turing patterns as seen on the faceplate of Atos BullSequana XH3000 (2-D version; TNP 02/16/22). But they use explicit time-stepping (“simple forward […] difference”, Section 3.1) where Crank-Nicolson semi-implicit time-stepping with Picard fixed-point iteration may be preferable for accuracy, along with ADI which does wonders for the spatial discretization (many very fast Thomas’ algorithms, in parallel).

AMD could probably do much worse than help improve AMDGPU.jl so that it better matches HIP (2x speedup if I read correctly). The ease with which the 2-PDE Gray-Scott solution could be implemented in Julia (vs a single PDE in HIP) should be justification enough!