Comments on: Intel Pushes Out “Clearwater Forest” Xeon 7, Sidelines “Falcon Shores” Accelerator

By: Michael Bruzzone

Michael Bruzzone — Thu, 06 Feb 2025 19:43:39 +0000

In reply to Mike Harris. Mike Harris, And thank you for this observation, "If customers are satisfied with the DRAM bandwidth per core of Emerald Rapids, they could use the 96 core version of Granite Rapids-AP with DDR5-6400, but that doesn’t appear to be happening. End-users and OEMs may be waiting until they can get MR-DIMMs to buy Granite Rapids. The only other explanation I can imagine is that there isn’t enough Intel 3 capacity for both customers who buy directly from Intel and for customers who buy from OEMs. There are currently no Granite Rapids servers available from Dell, HPE, Lenovo, Cisco or Supermicro. mb

By: Michael Bruzzone

Michael Bruzzone — Thu, 06 Feb 2025 19:32:16 +0000

In reply to Mike Harris.

Mike Harris,

Xeon Sierra Forrest and Granite Rapids at 369,652 units produced in 2024 or basically to date, is based on known channel supply volume of Emerald Rapids in its first 2 quarters of supply. My SF + GR estimate based on ER two quarter roll out, is averaged within the percent of known Emerald and Sapphire Rapids supply volume in 2024. It’s an SF + GR supply projection based on Emerald Rapids first two quarters volume recorded from open market ‘eBay’ sales offers.

Otherwise, there is no Granite Rapids in the channel and Sierra Forrest actual is 0.0057%. I record unit volume on Intel net take into division revenue. If on gross take into revenue the projection decreases to 189,652 units. On the higher estimate at 18 CPU per rack, equals 20,553 racks. My take is all ER + GR no matter the volume are Intel samples subject forward contract agreements for Intel product meeting customer specifications.

Alternately, there are none or few Sierra Forrest and Grante Rapids on a bad yield day?

‘Channel change in supply’ is just that. Relying on open market product available as a proxy for supply volume can cumulate quarter to quarter. Subsequently, the difference quarter to quarter reveals the gain or decrease in available supply. This can be relied to determine a production metric by backing out change in quantity from total quantity and cumulating quantities that are basic production microeconomic assessment methods.

By: Mike Harris

Mike Harris — Thu, 06 Feb 2025 13:14:30 +0000

In reply to itellu3times. I don't think there needs to be a systolic processor inside the HBM. The arithmetic for an LLM is just low precision multiplies (such as 8-bit) that are accumulated. One input of the multiply is a neural net weight read from DRAM and the other input is the output of the previous stage of neurons from a register bank. The reason for doing arithmetic inside the HBM stack is that it takes far more energy (> 10x) to move neural net weights from DRAM to CPU cores than to do the arithmetic. The arithmetic could be done on the DRAM die (which is what Samsung appears to have done) or it could be done on a logic die inside the HBM stack or it could be done directly under the HBM stack, which would be the Intel 3 base die on Diamond Rapids. The goal to is move the neural net weights the shortest distance possible because that reduces the energy consumed. As an example, if 8 8-bit multiplies are accumulated into 24-bits inside the HBM stack, instead of transferring 8 x 8-bit = 64 bits from HBM, only 24-bits would need to be transferred. This would provide more than a 2x improvement in HBM bandwidth and energy consumed. I have no evidence for this, but my guess is that NVIDIA will design a custom HBM for their exclusive use.

By: Mike Harris

Mike Harris — Thu, 06 Feb 2025 12:03:24 +0000

In reply to Michael Bruzzone.

Mike Bruzzone wrote: “DCG for 2024 by product on channel change in quarterly supply;

Sierra and Granite Rapids to date = 369,652 units”

What does “product on channel change in quarterly supply” mean and where does this number come from?

Sierra Forest was launched on 6/3/24 and Granite Rapids was launched on 9/24/24. I wonder if the limited shipment of Granite Rapids so far is due to limited supplies of MR-DIMMs. Granite Rapids-AP has up to 2x more cores than Emerald Rapids but only 1.7x more DRAM bandwidth when used with DDR5-6400. If customers are satisfied with the DRAM bandwidth per core of Emerald Rapids, they could use the 96 core version of Granite Rapids-AP with DDR5-6400, but that doesn’t appear to be happening. End-users and OEMs may be waiting until they can get MR-DIMMs to buy Granite Rapids. The only other explanation I can imagine is that there isn’t enough Intel 3 capacity for both customers who buy directly from Intel and for customers who buy from OEMs. There are currently no Granite Rapids servers available from Dell, HPE, Lenovo, Cisco or Supermicro.

By: itellu3times

itellu3times — Wed, 05 Feb 2025 09:16:23 +0000

In reply to Mike Harris.

Moving some math/logic into the HBM and it’s not really M anymore it’s some kind of systolic thingy or maybe we can call it a step towards a “neuromorphic” thingy. Whether any of that should be build on DRAM is yet another question. And if it’s not M then the HB requirement maybe doesn’t matter so much either, and it can be put back outside.

Just exactly how you build such a systolic thingy has stumped the field for fifty years (or longer) but it *seems* like a good idea.

By: Slim Albert

Slim Albert — Tue, 04 Feb 2025 13:13:09 +0000

In reply to Timothy Prickett Morgan. Interesting link by Mike (Harris) indeed! Doing Processing-In-Memory, in HBM (HPB-PIM), on AMD MI100s, Samsung saw a doubling of performance and lower power consumption (in 2023). Reminds me a bit of the researchers' hardware-related recommendations from TNP's last Monday's DeepSeek deep dive (Jan. 27) where they did matrix-multiply in FP8 but processed master weights and gradients in FP32. Coupling those with a logarithmic perspective may suggest doing matrix multiplies in PIM as sums of biased exponents (8-bits in FP32), with an xor of the sign bit, followed possibly by a max() to simplify accumulation. It could save valuable silicon area, and energy, if it works in both training and inference!

By: Michael Bruzzone

Michael Bruzzone — Mon, 03 Feb 2025 20:03:03 +0000

The only Intel 3 product in the channel is 64C 6710E in sample volume. There is no Granite Rapids in the channel. Emerald Rapids at Intel 7 shows highly manufacturable on supply data but in slim volume. Entering January Emerald available supply is 158% more than Turin but only 18.7% of Bergamo + Genoa available.

Intel is selling Xeon on the cheap, well under AMD price floor, with room to raise price and remain under AMD’s Epyc price floor. One question is why that has not happened? In 2024 Emerald average gross range $884 to $1312. Sapphire Rapids gained availability in q4 + 68% q/q for gross $1571.

intel ships 13,745,725 DCG Xeon units in 2024 and 18,598,456 NEX units on a net basis covering all costs.

My full DCG + NEX report is here in comment string.

https://seekingalpha.com/article/4753856-intel-q4-theres-no-good-news-here-sell-now

Mike Bruzzone, Camp Marketing

By: Timothy Prickett Morgan

Timothy Prickett Morgan — Mon, 03 Feb 2025 13:44:43 +0000

Agreed on all fronts. Especially when it comes to getting AI on CPUs and LLM math closer to the HBM that is on the CPUs.

By: Mike Harris

Mike Harris — Mon, 03 Feb 2025 13:29:01 +0000

I can think of 4 possible explanations for the delay in Clearwater Forest. Intel’s Xeon Diamond Rapids uses an LGA 9324 package to support 16 channels of MR-DIMMs. If Clearwater Forest uses the same package (which seems likely), the delay in Clearwater Forest could be due to difficulty getting 16 channels of MR-DIMMs to run at full speed (12.8 GTransfers/sec ?). This is understandably a difficult engineering challenge but one that is required to support the high core counts. Sixteen channels of MR-DIMMs running at 12.8 GTransfers/sec has roughly double the MR-DIMM bandwidth of Granite Rapids-AP. A second possibility is a problem with Foveros 3D used to connect the Intel 3 base die to the Intel 18A compute tiles. A third possibility is that Intel is capacity constrained on Intel 18A due to reduced fab spending. If the client Panther Lake processor ships in the second half of this year with Intel 18A, that would suggest that 18A is healthy. A fourth possibility is that Clearwater Forest is mainly for hyperscalers and hyperscalers have no money to buy it because hyperscalers are spending all of their money on NVIDIA GPUs. Amazon, Microsoft, Google and Apple have designed their own server CPUs so they are less interested in Clearwater Forest. Oracle has invested $1B in Ampere Computing.

Falcon Shores probably got cancelled because Intel concluded it was not competitive. Intel should strengthen the AI capability of their server CPUs. This could be done by adding neural engines to the Intel 3 base die and supporting HBM3e on Diamond Rapids. Sensible enterprises don’t want to send highly confidential information in AI prompts to third parties. They want to run LLMs in their own data centers. Given the prices of NVIDIA GPUs, server CPUs that can run LLMs are a good option for these customers. To save power, arithmetic for LLMs should ideally occur inside the HBM stack and/or directly under it in the Intel 3 base die. Moving LLM weights from DRAM to CPU cores consumes far more power than LLM arithmetic.

semiconductor.samsung.com/news-events/tech-blog/hbm-pim-cutting-edge-memory-technology-to-accelerate-next-generation-ai

By: Calamity Jim

Calamity Jim — Sat, 01 Feb 2025 06:29:25 +0000

In reply to Eric Olson. You get my vote!