GDDR7 Memory Supercharges AI Inference

68 points by PaulHoule 8 months ago

cogman10 8 months ago

What an annoying article to read. "The AI workload of AI in a digital AI world that the AI world AI when it AIs. Also the bandwidth is higher. AaaaaaaaaIiiiiiiiiiii".

90% of the article is just finding new ways to integrate "AI" into a purely fluff sentence.

cogman10 8 months ago

Ok, I should be fair, it's 4 paragraphs of fluff, 6 paragraphs of specs, then a fluff conclusion. It's almost like 2 different unrelated articles smashed into 1.
Still makes for an annoying read.
- ep103 8 months ago
  
  Sounds like the sorta thing AI would write
Cthulhu_ 8 months ago

AI only appears 7 times in the article's 11 paragraphs though. I mean I'm sure it's fluffed out, I glazed over and lost interest, but still.
rsynnott 8 months ago

> 90% of the article is just finding new ways to integrate "AI" into a purely fluff sentence.
I mean, to be fair, that’s half the industry right now. Hard to blame them all that much.
- skyyler 8 months ago
  
  >that’s half the industry right now
  Isn't that a bad thing?
  
  rsynnott 8 months ago
  
  Yes, but… It is how it is. Give it a few years, and it’ll be some new fad. There is always a buzzword du jour. This has been more or less the case for the industry since the 1950s.

Retr0id 8 months ago

PAM3 is 3 levels per unit interval (~1.58 bits), not 3 bits per cycle as reported in this article. Although I suppose if you count a cycle as both edges of the clock it's 3.17 bits.

alberth 8 months ago

Is I/O starvation the bottleneck with GPUs?

I didn't think it was.

hmottestad 8 months ago

Memory bandwidth is the bottleneck for LLM inference. That's my understanding at least.
- moffkalast 8 months ago
  
  That's correct, but also compute to some degree. The larger the model the more of a bottleneck memory becomes.
  There are some older HBM cards with very high bandwidth like the Radeon Pro VII which has 1TB/s of bandwidth like the RTX 3090 and 4090, but is notably slower at inference for smaller models since has less compute in comparison. At least I think that was the consensus of some benchmarks people ran.
- littlestymaar 8 months ago
  
  Isn't it only the case when inference isn't batched?
  
  Tostino 8 months ago
  
  Even in a local setting, batched inference is useful to be able to run more "complex" workflows (with multiple, parallel LLM calls for a single interaction).
  There is very little reason to optimize for just single stream inference at the expense of your batch inference performance.
mmoskal 8 months ago

With a typical transformer and a GPU the batch size that saturates the compute is at least hundreds. Otherwise (including typical size of 1 for local inference) you're memory bound.
corysama 8 months ago

GPUs have much more memory bandwidth than CPUs. Meanwhile, the ALU:bandwidth ratio of both GPUs and CPUs has been growing exponentially since the 90s at least. So, the FLOPs per byte required to not be starved on memory is really large at this point. We’re at a point that optimization is 90% about SRAM utilization and you worry about the math maybe at the last step.
alwayslikethis 8 months ago

For inference, it often is. Though for most consumer parts the bigger concern is not having enough VRAM rather than the VRAM not being fast enough. Copying from system RAM to VRAM is far slower.

vdfs 8 months ago

Not even a mention of Blockchain

trollbridge 8 months ago

Can’t attract VC money with it anymore
- Tostino 8 months ago
  
  That's good IMO. So much money wasted over the past decade.
  
  mnky9800n 8 months ago
  
  [flagged]
  
  Tostino 8 months ago
  
  I wasn't even talking about the people "investing" in crypto. Just the VC / business side of things.
  Just a massive waste, on people who had just about no plan going in other than "disrupt the status quo" and "decentralized".

blackoil 8 months ago

If 5090 comes with 32GB of this RAM. That should be substantial boost over 4090!! Hope that isn't reflected in the price.

moffkalast 8 months ago

Nvidia: You're getting 2GB of VRAM and you're gonna act like you like it!
formerly_proven 8 months ago

> Hope that isn't reflected in the price.
lmao
AMD has even officially announced at this point that they will not compete on high-end consumer and workstation GPUs for years to come. Intel can’t (Gaudi is not general-purpose, so too limited appeal for that market).

hmottestad 8 months ago

"With this new encoding scheme, GDDR7 can transmit “3 bits of information” per cycle, resulting in a 50% increase in data transmission compared to GDDR6 at the same clock speed."

Sounds pretty awesome. I would think that it's going to be much hard to achieve the same clock speeds.

inportb 8 months ago

If it could really do that, then it wouldn't be DDR, right?
- formerly_proven 8 months ago
  
  DDR just says symbols are centered on (both) edges, doesn’t say what the symbols are.

ilaksh 8 months ago

So it's almost twice the performance? That's great. But AI could actually easily use 10 times.

Anyone heard anything about memristors being in a real large scale memory/compute product?

sva_ 8 months ago

Trying to figure out how this compares to HBM3/e

octocop 8 months ago

What does "48 Gigatransfers per second (GT/s)" mean?

smolder 8 months ago

It reflects the data rate. Since DDR memory transfers data on both the up and down part of the clock signal, DDR RAM on a 3000Mhz clock signal is said to make 6000 Megatransfers per second, in normal usage. 48 GT/s would imply a 24Ghz clock if it were normal DDR, which seems absurd.
Edit: It seems GDDR6 is in reality "quad data rate" memory, and GDDR7 packs even more bits in per clock using PAM3 signaling, so if I'm reading this right maybe they're saying the chips can run at up to 8Ghz base clock? 8ghz * 6 bits per cycle * 32 bit bus / 8 bits per byte = 192GB/s.
Edit again: It seems I undercounted the number of bits/pin per cycle of base clock for GDDR7 and it's more like 12 (so 4ghz max base clock) or even 24 (so 2ghz), which seems a lot more reasonable.
TowerTall 8 months ago

Gigatransfers per second (GT/s) measures the rate of data transfers rather than the data rate itself. Each "transfer" represents one unit of data moved across a data bus per clock cycle, which can be thought of as one signal transition on the bus.

grahamj 8 months ago

Well, yeah

Any bets on when it gets renamed AIDDR? Only partly joking

the-rc 8 months ago

More like NeuralRAM? We have precedents. Back in the 90s, Sun and Mitsubishi came up with 3DRAM, which replaced the RMW cycle in Z-buffering and alpha blending with a single (conditional) write, moving the arithmetic into the memory chips.
burnte 8 months ago

DDR with Copilot!