Comments on: Building The Perfect Memory Bandwidth Beast

By: HuMo

HuMo — Mon, 30 Jan 2023 00:23:32 +0000

In reply to HuMo. You can also see this graphically on tests of Xeons, EPYCs, and Ampere Altras (eg. Google: Andrei Frumusanu ice lake sp memory latency) where apparent latency from DDR through the cache hierarchy drops from 100ns for full-random reads, down to just 5ns for linear chain (sequential reads) (and then apply Little's Law to the 64B cache line with that 5ns to get 200GB/s).

By: HuMo

HuMo — Sat, 28 Jan 2023 23:59:21 +0000

Oh yeah! Feeding a 512-bit (64 bytes) vector engine (say AVX512) at 3.1GHz needs 200GB/s — super if the machine can provide it! With DDR5, if you can tame RAS-CAS-mataz latency with 2×8 consecutive BL16 burst reads (512 bytes from each of its 2 on-chip channels), so that cache line size is psychologically 1kB to Little’s Law (yet physically still 64B), then “Bang!” your core can ingest those 200GB/s provided by attached mems (with Lat.=100ns, and MSHR/LFB=20). The data has to be in 2 sets of sequential addresses though (should be ok for part of a matrix row, and column).

By: reyhan

reyhan — Sat, 28 Jan 2023 10:46:05 +0000

stacked memory bandwidth and that they do not run at anywhere near peak

By: Hubert

Hubert — Fri, 27 Jan 2023 12:01:36 +0000

The article and John’s comment are quite interesting (IMHO). Little’s Law may deal with how to keep the memory system busy, and one may also want to think about how to keep the cores busy. The number N of cores (or maybe threads) needed to keep memory busy when each core supports O outstanding cache misses, with cache-line size S, when the memory system has n channels, each of which has bandwidth B, and the access latency from DDR through the cache hierarchy is L (typ. 100ns) would be given by:

N = n B L / O S

The system may then have throughput, T, that matches the aggregate memory bandwidth: T = n B (if cores are fast enough). If each core runs at clock frequency C, and processes 64-bit data (8 bytes), then the number, P, of ops that the data processing algorithm should perform (for each input data) to keep all cores busy would be:

P = 8 N C / n B

and with N=24 cores running at C=2GHz with aggregate bandwidth nB=307GB/s, one gets a rather balanced P=1.25. But increase the number of cores, the clocks, or decrease memory bandwidth, and a more complex algorithm (higher P) is needed to keep cores busy. (Disclaimer: standard mis-interpretations and mis-calculations apply).

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Fri, 27 Jan 2023 01:54:24 +0000

In reply to John D. McCalpin, Ph.D.. I understand why we have GPUs. But not all code can be moved easily to GPUs and many datasets explode out beyond the very limited memory capacity of HBM. Hence my thought experiment. I see what you are saying, though, about the cache hierarchy. All I know is that by adding HBM, Intel is able to boost real-world HPC app performance on certain codes by 2X, which suggests to me that bandwidth is indeed a bottleneck. If I can get 2X or 4X or 8X by adding NUMA to Sapphire Rapids HBM, and not have to change codes and have a very large memory space, I think it might be a fair trade when fully burdened with the cost of altering applications added in.

By: John D. McCalpin, Ph.D.

John D. McCalpin, Ph.D. — Thu, 26 Jan 2023 16:43:49 +0000

Many vendors have offered “high-bandwidth” server configuration options at many times during the last 20+ years. They have not sold well, typically because the cost savings of using a processor with fewer active cores is not large enough relative to the total system cost — customers realized that it was better to buy the extra cores of the more “mainstream” configurations even if those cores will only be useful for a portion of the workload.

More recently a new (and more technologically challenging) problem has dominated: cores don’t support enough memory concurrency to sustain higher bandwidth.

Little’s Law says that at a system latency load latency of 100 ns, a core that supports 20 outstanding cache misses (64B each) can sustain a read rate of 1280 Bytes / 100 ns = 12.8 GB/s.
For a “mainstream” DDR5/4800 server chip with 8 DDR5 channels (307.2 GB/s), it takes at least 24 cores to generate enough cache misses to fill the memory pipeline. This limitation applies (with different specific numbers) to any processor designed to work with a flat memory model and transparent caching.

GPUs get around the problem by having a whole lot more small cores.

The NEC Vector Engine processors take a different approach — memory concurrency is generated by vector operations that cover many cache lines. For long vector operations, a single core in an 8-core NEC VE20B processor can sustain 300-400 GB/s of the ~1500 GB/s peak bandwidth available from the 6 HBM modules. This requires that 30-40KiB of memory transactions be in flight at all times. With 64-Byte cache lines this would require more than 650 outstanding cache misses — not something that is easy to imagine from a traditional cache-based architecture.