Comments on: How China Made An Exascale Supercomputer Out Of Old 14 Nanometer Tech https://www.nextplatform.com/2022/03/11/pondering-the-cpu-inside-chinas-sunway-oceanlight-supercomputer/ In-depth coverage of high-end computing at large enterprises, supercomputing centers, hyperscale data centers, and public clouds. Wed, 29 Nov 2023 20:38:45 +0000 hourly 1 https://wordpress.org/?v=6.7.1 By: Timothy Prickett Morgan https://www.nextplatform.com/2022/03/11/pondering-the-cpu-inside-chinas-sunway-oceanlight-supercomputer/#comment-216933 Wed, 29 Nov 2023 20:38:45 +0000 https://www.nextplatform.com/?p=140200#comment-216933 In reply to emerth.

Yes. Thanks for the catch.

]]>
By: emerth https://www.nextplatform.com/2022/03/11/pondering-the-cpu-inside-chinas-sunway-oceanlight-supercomputer/#comment-216930 Wed, 29 Nov 2023 19:26:55 +0000 https://www.nextplatform.com/?p=140200#comment-216930 In reply to Freddie.

“The SW26010-Pro is rated at 14.03 petaflops at either FP64 or FP32 precision and 55.3 petaflops at BF16 or FP16 precision.”

Did you mean teraflops?

]]>
By: Eric Olson https://www.nextplatform.com/2022/03/11/pondering-the-cpu-inside-chinas-sunway-oceanlight-supercomputer/#comment-182897 Tue, 15 Mar 2022 06:15:06 +0000 https://www.nextplatform.com/?p=140200#comment-182897 How similar is the Shenwei SW26010-Pro to a DEC alpha with vector extensions?

Is there an assembler reference guide (in any language) available?

I wonder if the processor will ever be available outside China or if that is impossible due to unlicensed use of patents and other intellectual property. In either case, it would be interesting if QEMU could be made to emulate it.

]]>
By: Freddie https://www.nextplatform.com/2022/03/11/pondering-the-cpu-inside-chinas-sunway-oceanlight-supercomputer/#comment-182616 Sat, 12 Mar 2022 22:13:58 +0000 https://www.nextplatform.com/?p=140200#comment-182616 If I understand correctly each CPU has ~14 TFLOP/s FP64 and 307.2 GB/s of memory bandwidth. Comparing with an NVIDIA V100 we find ~8 TFLOP/s FP64 and 900 GB/s of memory bandwidth. So twice the FLOP/s and one-third the bandwidth for an overall FLOP-to-byte ratio which is ~1/6th that of a GPU. Comparing with the Fujitsu A64FX (~3.1 TFLOP/s and 1 TB/s of bandwidth) the ratio is ~1/15. Given that many HPC codes are memory bandwidth bound (ML applications excluded) and things are no longer as impressive.

Bandwidth and usability may not win you first place on the TOP500, but they will help when it actually comes to doing science. This is something the US arguably learned after Roadrunner and to a far lesser extent Titan. Intel and NVIDIA have both been very good here recently (Intel improving L2 cache sizes and introducing HBM on upcoming Xeon’s and NVIDIA since Volta where they dramatically improved the cache hierarchy and microarchitecture).

]]>