Why can't GPUs have removable memory like PC ram?

[-]

CatalyticDragon@reddit

Signal integrity is better with the shorter traces which means you can push clocks higher and get better performance (bandwidth and latency). This is pretty crucial.

Cooling. It's easier to get soldered memory chips under the heatsink and cooling is also crucial.

Cost. It's cheaper to make a board with chips soldered directly to it rather than implementing some form of mechanical slot system.

[-]

That's true. That's why Apple Silicon is still, after 5 years, blowing everything else out of the water when it comes to performance per watt and battery life. Yet people whine that they can't upgrade RAM themselves, go figure. (However, Apple charges waaaaay too much for anything about base RAM.)

[-]

Ok_Warning2146@reddit

They are "blowing everything else out of the water when it comes to performance per watt and battery life." because they are the first to use the latest processing node despite the higher cost. Nvidia 5090 still stucks at N4P (143.7MTr/mm2) but Apple M4 is N3E (216MTr/mm2).

[-]

germane_switch@reddit

The industry is just now starting to catch up to M1. That’s five years old.

[-]

Massive-Question-550@reddit

true, they have good performance but charging an extra 2 grand for more ram is basically robbery.

[-]

Upper-Requirement-93@reddit

It's like people don't like sinking hundreds and hundreds of dollars into new gear year after year because of one bad component, go figure, such whiners lol

[-]

germane_switch@reddit

I understand the sentiment, for sure, but I have never encountered Apple's RAM going "bad." Nor any of their SSDs, for that matter. I've been using Macs since I was a kid in 1984, although I didn't buy my first until 1997. I was team Amiga. :)

[-]

tylercoder@reddit

How do motherboards compensate for signal integrity on CPUs?

[-]

DeltaSqueezer@reddit

Running at lower speeds 6 GT/s vs 19.5 GT/s for a 3090
PC memory bus is only 64 bits wide. A 3090 is 384 bits wide and a 5090 is 512 bits wide. It's probably physically impossible to route 512 lines across the motherboard given space constraints AND having to arrive in time with a might tighter timing spec.
Even if #2 WAS possible, the energy costs to push this data across this long super-wide bus would be huge
You'd probably also have huge RF issues too

[-]

tylercoder@reddit

What if instead of putting memory in sodimm-like sticks it was placed on a daughterboard-ish with a custom HS connection? I remember seeing some old double and triple-stacked cards where the GPUs were on one board and the memory in another.

Just asking I'm not an EE.

[-]

hishnash@reddit

You could do it if you have them on sockets like cpus but you would end up paying more for these sockets than you woudl if you just shipped with maximum capcipity memroy dies.

[-]

tylercoder@reddit

Are there sockets for those BGA chips? Can it even be made?

[-]

hishnash@reddit

Building a low noise, socket that does not suffer from RF issues like single reflection etc costs a fortune. They are used a lot on prototype boards to test things out.

But there is no point shipping HW to consumers with these since the cost of the socket and other PCB related changes for each GDDR chip would cost more than the highest capacity chip that you would be putting into it.

the point of having a socketed memory would be to enable you to upgrade it, however that is pointless if such a board costs more than a board with the maximum capacity soldered directly to it.

[-]

eding42@reddit

Still run into issues with signal integrity, there’s a reason why GDDR chips are arranged directly around the die.

[-]

tylercoder@reddit

What about making said memory daughterboard's plug be right next to the die? Again just theorizing here, plus ewaste is a big issue and its not going away.

[-]

Thick-Protection-458@reddit

By not having memory speed as a bottleneck enough, I guess.

Because, for instance:

- Ryzen 9 5950X (don't know which segment this CPU is, but that's about order of magnitude rather than exact values) - maximum memory bandwidth is 47.68 GB/s

- Than: M3 Max has the same 400GB/sec memory bandwidth as the top end M1 and M2 chips

- 4090 memory bandwidth is 1008GB/s

So basically - they don't. It's just the typical tasks for them being more I/O or computation or totally memory size heavy rather than memory bandwidth heavy.

[-]

Ok-Scarcity-7875@reddit

CPU has a maximum bandwidth of around 50GB/s? How are my 7950x has a maximum write speed of around 2TB/s and read speed of more than 4TB/s with 131KB test size which means that it fits into the L1 + L2 cache? I have read somewhere that the L3 cache is about 1.7TB/s, so to me that is the bottleneck speed. If RAM had the same bandwidth , I guess we could get it too at least the half if we are pessimistic like 850GB/s and up to 1.5TB/s if we are optimistic. I think it is technically no problem, but corporations don't want it for $$$reasons.

[-]

tronathan@reddit

DMA

[-]

Thick-Protection-458@reddit

Well, cache is right here on CPU crystal - so much less distance-based inductive/capacitive electronic shit happens which will limit one channel perfomance.

And it is probably easier to make more parallel channels for cache due to the same set of reasons.

And, well, my knowledge of electronics design is limited to a amateur level, but from what I can guess:

To achieve something similar with external memory you will need more parallel channels. Hardly you will get much more of each one of them.
And designing even parallel solution will be quite a challenge. A challenge not required for most tasks (we were basically fine with current speeds until heavy generative LLMs, weren't we? Surely partially thanks to workarounds like cache but still).
And finally it would be more challenging than integrated RAM solutions.

So while probably (I won't even tell definitely) possible - we can't expect already established companies to - spend more resources than for easier integrated solutions - developing solutions which will be in the same time quite niche - and which will nuke their own market (because now new devices will have even stronger competition from previous models)

[-]

Thick-Protection-458@reddit

Basically - that's what we think "ideal" signal is.

[-]

Thick-Protection-458@reddit

And here is what it becomes.

As you can see - we need more time to realize that input state really changed, it's not a random noise.

That time alone may be limiting at high frequencies - we at least need to limit it to noticeable lesser value than inverted frequency.

And we also need some time for module (RAM/CPU internal circuitry to works), so there is non-zero constant time we will always need to process the input.

So basically we need time like `t = t_ideal + t_we_lost_to_distortion`

[-]

darth_chewbacca@reddit

The 7950x is more in line with the M3 as both use a version of DDR5 (rather than the 5950x using ddr4).

Your point still stands, the 7950x has a memory bandwidth of 83.2GB/s, just adding some colour to your explanation.

[-]

Trisyphos@reddit

Both DDR4 and DDR5 use 64bit wide bus. It's about number of channels that can be used. Normal desktop PCs are dual channels so 128bits, some cheap laptops are only single channel so 64 bit but some HTPC and servers are quad channel.

[-]

MrMisterShin@reddit

This is correct, to have similar bandwidth performance to the M3 Max. You would need to use a HEDT / Server class CPU, which can handle more PCIe lanes for greater memory bandwidth.

[-]

ortegaalfredo@reddit

Apple chips since the M1 uses HBM memory, basically many chips of memory on top of each other like layers of a cake, it's the same kind memory that a H100 uses, that's why they are so expensive.

So comparing a Ryzen with a M3 is like comparing a 4090 with a H100.

[-]

Monkeylashes@reddit

lol someone drank the kool aid

[-]

BangkokPadang@reddit

They don't use HBM, though. They use LPDDR4X / 5X depending on the generation.

[-]

ZasdfUnreal@reddit

CPUs hit their cache over 90% of the time for memory reads because their data sets are tiny compared to GPU use cases.

[-]

hishnash@reddit

CPUs do not have that high a bandwidth need and those that do use soldered on package memroy (see apple silicon SOCs).

[-]

gahma54@reddit

The CPU uses memory caches and executing a program has a very predictable access pattern (linear/sequential). So, with a good cache algorithm and prefetching, the parts of the executing program that need to be in cache almost always are

[-]

Pavrr@reddit

Why dont we have more motherboards with onboard memory?

[-]

brotie@reddit

Er… every thin&light ultrabook and all Macs have been doing this for the better part of a decade

[-]

Megneous@reddit

The vast majority of people don't buy macs though.

[-]

brotie@reddit

What? I was just replying to someone asking why there aren’t any built in RAM options, it’s not just apple - Dell uses soldered memory on the XPS line and Lenovo on the yoga. Super common

[-]

commanderthot@reddit

Because of the maturity of DDR memory, and how poorly soldered memory motherboards would sell compared to the abundance of slotted designs.

[-]

cobbleplox@reddit

Which is funny because upgrading RAM seems a lot more practical than it usually turns out. And I'm not getting the impression many people actually ever do it. What usually happens is people get 2 banks at the limit of what their cpu/mainboard supports. And if they need more RAM, they can always plug in another two, right? Yeah well turns out 2 years later it is very hard to find another kit with all the same properties. And 4 banks suddenly do not support the same speed that 2 banks did. So you have to throw away your old kit and replace it. But hm. Shouldn't it then at least be faster? There are all these faster modules now! But my stuff doesn't support it.... I mean it seems like a waste to still invest in DDR4, right? And so on. You just end up thinking about your entirely new system. Very few scenarios where you actually need that bigger RAM so much now that you would bother. Best scenario I can think of is actually that the RAM was damaged and then it's very good that you can replace it.

[-]

unculturedperl@reddit

Take it one step further, and put memory on CPUs. Intel has already done this of course, https://en.wikipedia.org/wiki/MCDRAM

[-]

furrykef@reddit

Caching takes a lot of the sting out of low memory speeds. Wouldn't work so well for a GPU, where you often have to access very large amounts of memory in a very short time.

[-]

Massive-Question-550@reddit

that isnt the case as if you look at the 3000 vs 4000 rtx cards they increased the l2 cache from a measly 5mb in the 3080 to a whopping 64 mb in the 4080. clearly there was a reason for 12x jump or they wouldn't do it.

[-]

noiserr@reddit

AMD did that first with Infinity Cache in the RDNA2 GPUs. Nvidia followed suite with the 40xx by increasing L2 cache by 16 times.

It allowed RDNA2 GPUs to use narrower memory bus but achieve similar performance. RDNA2 was also more power efficient than Nvidia's 30xx series.

Thing is, I don't think LLMs benefit from the cache as much. I could be wrong or perhaps the software isn't optimized to take advantage of the cache. But when you compare 3090 to 4090 in LLM performance, 4090 should be much faster than it is.

[-]

Massive-Question-550@reddit

I imagine it could have some minor usage. For example I don't think an AI goes through every single parameter for every task it is assigned so it would make sense that very commonly accessed parameters could be stored in the cache but In the grand scheme of things it would be a pretty small increase in speed due to its still very limited size.

[-]

vincentz42@reddit

Caching is not going to help LLMs at all unfortuately. Every time you generate a token you will have to read every single activated parameters and KV cache from the ram, which is likely a full read of ram if you are running the largest model possible. In that access pattern cache is not going to help at all. In fact, NVIDIA's H100 has half as much cache as the RTX 4090, but >3x the memory capacity and bandwidth.

[-]

Calcidiol@reddit

Common GPUs intrinsically are designed to assume they've got high latency access to VRAM which they try to "cover up" by having the practice of being able to launch a lot of threads / parallel jobs at once many of which will block waiting for RAM I/O for a "long time" relative to the processor speed, but that's ok since statistically (or as may be orchestrated in the algorithm / program design) there will still be a lot that do have data available / ready to work with which they can run with so overall despite the "slow" "high latency" of VRAM (vs. the GPU compute TFLOps/s compute speeds) the parallelism is a technique that can be sometimes used to cover that up and achieve peak performance (on a sustained / statistical basis) of the VRAM BW while letting the stuff that can run now run now.

Processors are typically way faster than the RAM these days even if you mean VRAM; look at the benchmarks for register access vs L1/L2/L3/RAM access for a typical CPU running at multi-GHz speeds.

[-]

Fantastic-Berry-737@reddit

I'm don't think the comment you're responding to is fully helpful. Some old models used to come with upgradable VRAM, but the performance doesn't scale to the demands of today's GPUs.

To answer your question, DRAM isn't soldered because doesn't need to be to minimize latency, it does that with the CPU cache and pipelining. VRAM is optimized for bandwidth, not latency, so it has different demands.

Anyone reading this thread won't get a useful understanding of GPU computing and VRAM from a 3 sentence comment. It's three sentences, guys. Here is an excellent lecture one of the inventors of CUDA put on YouTube condensing a lesson he gave to his intern. It is gold and I would watch the whole thing.

https://www.youtube.com/watch?v=3l10o0DYJXg

In short: reject shortform learning for deep subjects, just watch the video.

[-]

Calcidiol@reddit

Latency isn't a big deal in most cases wrt. socketing. Look at the speed of "light" (transmission line signal propagation) on a PCB. So many ps/cm distance depending on the details. So a RAM device 2cm away from the CPU attached by a good transmission line has the same speed-of-EM latency as whether it is in a socket or not as long as the impedance of the transmission line is the same and we're talking about similar kinds of TLs / physical materials of PCBA.

Sockets for high speed signals have to have good SI and maintain good transmission line performance up to the rated frequency / speed of the socket whether that's 1 GHz, 10 GHz, 20 GHz, etc. In that respect the concerns are no different than a non socketed PCBA design which requires SI optimization / control to keep it within acceptable limits.

So from that standpoint there's zero difference.

If you're talking about indirect effects like sockets tend to be larger than non socketed PCB footprints / packages, ok, sure, maybe you add a few mm or cm due to the physical socket footprint and routing at which point you're some few fs / ps longer in propagation delay and the speed of light latency increases a small amount; usually pretty small compared to the clock frequency but it can add up in the upper N to NN GHz range so maybe you end up adding an extra latency half cycle at some point or something but the sustained throughput stays the same regardless of latency increase for a sufficiently long burst or sequential access etc.

Also the same argument can be made for simply routing / placing RAMs closer to the CPU regardless of socketing, and simply using smaller (physically and in terms of storage capacity) parts.

So sure you could achieve some high speed really small RAM by making it close to the processor physically and we do, those are registers, SRAM regions, caches, etc. But if you want a significantly capacious size of RAM ultimately you place it downstream of the CPU, allow the connection of more numerous / larger RAM ICs, and start to see benefits of expandable / socketed RAM.

There are obvious (considering this forum's topic) advantages to having quadruple the RAM size even if it runs at 2x the latency or whatever because your latency is infinite (you can't run your program at all) if you don't have enough RAM to handle the data / program, whereas if you can expand the RAM cost effectively / physically to the point where you CAN run your model / program / data processing then at least you have a workable solution and you then get to even start being concerned about processing speed / latency benchmarks.

Many GPUs (and CPUs these days) are too limited in RAM size to even tackle many kinds of relevant problems, and that's insoluble by trivial means unless the RAM is suitably expandable / sized.

Whether a given burst has an extra cycle of latency is a far less relevant limitation than whether you have enough RAM at all.

[-]

Thick-Protection-458@reddit

New Apple ARM macs and this new NVidia machine, no?

Guess it's just about legacy + the fact memory speed and bandwidth is not exactly the bottleneck usually.

[-]

noiserr@reddit

AMD's new Ryzen AI Max also has this.

[-]

tylercoder@reddit

Nobody would buy them, meanwhile afaik there's never been graphics cards with removable memory.

[-]

ainsleyorwell@reddit

Graphics cards with removable memory modules definitely have been made, just not recently by the major manufacturers (at least as far as consumer-grade equipment is concerned) :)

[-]

tylercoder@reddit

You better link me one because I cant find any GPU with removable memory.

[-]

Massive-Question-550@reddit

https://www.quora.com/Have-there-ever-been-graphics-cards-with-user-expandable-VRAM

[-]

Thistleknot@reddit

we do

Intel and Mac has been doing this w those square chips. same w Playstations

[-]

MachineZer0@reddit

Rather odd isn’t it? I see soldered memory in cheap Chromebooks and Windows S mini PCs.

[-]

Jattoe@reddit

It's only been within the last couple years that VRAM's importance has become so pertinent, just wait. There will be things like this, or just as good (as in, starting out with a ton of VRAM) in the near future.
NVIDIA is coming out with a computer that has some 128GB VRAM (shared, I believe like Apple's, but that's still CRAZY good. Just a bit slower -- like the 20__ series of RTX)

[-]

seiggy@reddit

The 128GB of unified memory on the NVIDIA Project Digits is still only LPDDR5X. With a max theoretical bandwidth of 512GB/s. Which is almost half that of a 3090, and only something like 30% of the speed of the GDDR7 on the 5090. So it’s not really fair to compare Project Digits ram with VRAM, as GDDR hasn’t been that slow in several generations now.

[-]

Jattoe@reddit

To a lot of people, VRAM is the main bottle neck--it just depends on what you need done. Time personally is not a huge factor for me, if it goes from 2s to 20s (10x slower) for production

[-]

Massive-Question-550@reddit

hp is coming out with one that will be even cheaper. this is good since ram speeds have been in a terrible bottleneck as cpu makers always wanted people to pay 10k plus to get a enterprise cpu with 4-8 channels of ram vs a measly 2. by comparison a 4090 has 12 channels of ddr6x memory which is crazy fast. nearly 10x the speed of fast dual channel ddr5.

[-]

astralDangers@reddit

This is the answer and it has been given hundreds of times. OP needs to use the search feature before posting.

[-]

Jattoe@reddit

The conversation is novel though, because time has passed; as well, there are many opinions. Also, the old discussions are not active. I think this is half for the sake of learning what's new.

[-]

astralDangers@reddit

The answer never changes because Ohm's law is a constant. It's an issue of physics..

Even if there is a transformative innovation in material science, it's 7-10 years minimum to be brought to market.

This question has been asked for over 2 decades ever since usable installable ram on GPUs went away..

Why a GPU ram can't be user installable is hardware topic not LLM.

Ironically this question can be answered by any good size LLM.

[-]

Jattoe@reddit

Most questions can be answered by an LLM, really, but Reddit isn't purely for utility, it's really just a medium between you and other humans. Bringing up the same topic that's been brought before can be fine in a big community, as many that will ask and will respond will not be the same people. It's not a purely technical question, it's speculative, or leads towards speculative conversation--which is novel.

[-]

AnticitizenPrime@reddit

OK, I don't know shit about shit when it comes to this hardware stuff, but what if it were engineered like CPUs currently are, as a wide flat die + socket on the board? We can hot swap/upgrade CPUs currently and they deal with the same limitations, right? Is it just an engineering issue?

[-]

Massive-Question-550@reddit

its funny because on most 3080's and 3090's there is no v ram cooling which is why miners had to disassemble them to add thermal pads to make contact with the heat sink.

[-]

AD7GD@reddit

Signal integrity is also impacted by the connector itself.

[-]

Phate334@reddit

[-]

Lower-Possibility-93@reddit

Then I would have problems removing the cash from your pocket.

[-]

CertainMiddle2382@reddit

Market segmentation.

Nvidia has a monopoly. ASUS, Gigabyte don’t

[-]

malformed-packet@reddit

They used to way back in the day, there were cards with sodim slots on them. The problem is this ram isn't fast enough for most things.

[-]

tylercoder@reddit

How back? Never seen a GPU with that, even ancient SGI boards from over 30 years ago have soldered memory.

[-]

eposnix@reddit

My old Trident VG-2000 had expandable memory, up to like 1MB or something. We're talking early 90's here though.

[-]

tylercoder@reddit

Must be really old because I can't even find a pic.

[-]

BorderKeeper@reddit

You don’t just say to people that they are ancient like that :| it’s going to destroy them

[-]

tylercoder@reddit

I mean maybe he was like 8 and his dad bought him his first car, I'm not a zoomer either.

[-]

pier4r@reddit

search like: DIAMOND STEALTH 64 DRAM

[-]

tylercoder@reddit

Older than that, and it was like $10000

[-]

Aphid_red@reddit

This already exists. Everyone's comparing to regular desktop systems, but you should be looking at server systems, where you have AMD offering 24 lanes of RAM to 2 chips for 960 GB/s.

While the RAM is slower than VRAM, the most important thing to note is diminishing returns: if you make your RAM 3x faster and call it VRAM, you have more heat and signal issues versus just using a 3x larger bus. Yes, that costs more, but on the other hand regular RAM chips are relatively cheap, commodified, and so on. More capacity per interface as well: a RAM stick can have dozens of little 1GB chips on them for 64GB per stick or even more, whereas VRAM is still limited to 2-3GB per connection. Looking at the Epyc stuff; the boards cost on the order of $700 to $1500 for a full featured motherboard, including the enterprise tax.

I certainly think that, at least theoretically, a GPU using Epyc Genoa's form factor and memory setup (24 banks, 24x64bit = 1532-bit bus width, DDR6 speed) should be possible to create. It would be attached via riser to a regular motherboard. There's a number of challenges;

Primary: A new form factor. It won't fit in your normal PC case, you'll need a larger case where they can either sit behind eachother or next to each other. Alternatively, use the 2P epyc board as a base. A 1CPU 1GPU board with say 4 CPU lanes and 20 GPU lanes would barely fit the EATX form factor and achieve approx. 800GB/s memory speeds (between 4080 and 4090, or equal to the 5080) using ECC registered DDR5-5600 for stability.

Secondary: Cost. You're going to have to pay around $1K for the board, a few hundred for the graphics chip, and then $3-$5 per GB of memory you want. On the up side, with 24 chips up to 1.5TB of memory is reasonably affordable. (And for crazy prices you can go up to 6TB). Doing that with GPUs costs currently $300,000 or more. '500GB or more of memory, attached to a graphics chip' currently is not available at all, while most of the components needed to achieve it exist.

[-]

scottix@reddit

I feel maybe end of this year or next year will see more NPUs and GPUs will be a thing of the past we used to run Ai's on, or some legacy models will use GPUs.

[-]

BlobbyMcBlobber@reddit

Money. They can make more by selling you iterative models with slight memory upgrades every time.
We are not on their radar. They are aiming at data centers and large companies.
From Nvidia's point of view, running things locally on edge devices is supposed to be very tiny models running on consumer phones for example. See previous point.

[-]

YourAverageDev0@reddit

because Nvidia wants to make more money

[-]

Express-Dig-5715@reddit

mainly because speed and integrity of the signal, connecters are terrible for signaling, other thing is that the closer the memory is to GPU/CPU the better, you can see advantage of this in M4 architecture of Apple silicon. It can be pushed way faster and error free.

Another thing is the space required. Imagine having to implement connector for DIMM slots or something similar, even low profile ones will take a lot of space, GPU's nowadays is quite thicc bois tbh. nobody wants 3U size gpu in their pc.

GPU silicon is sorted when manufacturing on their imperfections and defects that's why there are multiple released versions with lower specs (eg 3060 3090 and so on). Some of the gpu's will not benefit from more ram if GPU itself cannot process what's stored in them. Silicon manufacturing is truly a black magic and is prone to defects.

[-]

Not_Obsolete@reddit

I mean NVIDIA basically intentionally nerfs their lower spec GPUs with too small VRAM compared to other specs.
Being able to upgrade the VRAM instead of buying better, more expensive model, is not likely something they want to encourage.

[-]

Agabeckov@reddit

Well, Apple sells laptops with soldered memory and they are able to charge much more per GB compared to generic RAM.

[-]

ProfaneExodus69@reddit

The biggest issues with memory when making computers are signal strength, speed and isolation. All those are influenced by the distance from the computing unit, the quality of the connection and interference with other circuits. Soldier the memory on the board right next to the computing unit will give you a much better performance than having to rely on connections that don't give you a contract as good as soldering, require extra circuits to accommodate the memory for it to have any value to be removable and add extra length to the circuit.

Additionally, that also has an extra cost and some design consideration. So you'll have less performance, bulkier cooling solutions at a higher price.

[-]

Imaginary_Bench_7294@reddit

In any circuit that is time sensitive, such as with memory, increasing the distance between devices increases the time for the signal.

In order to make components removable, the conductors, or traces as they're called on circuit boards, must be extended and shaped into a socket. A socket, depending on the design, can end up taking more space than mounting a component directly to the board.

Modern high speed memory chips used on graphics cards use contact points that span most of the underside of the chip. To make a socket that would accommodate the chip, include clamping or some other way to secure it, while maintaining minimal trace length, would end up requiring more surface area on the PCB.

There's a whole bunch of other things that play into it as well, such as signal integrity which includes noise, susceptibility to stray EM, cross-talk, etc. Then you also have power requirements. All conductors used in modern electronics are lossy, meaning they have resistance. The longer the wire or trace, the more resistance there is. That's one of the biggest reasons computer power supplies provide 3.3, 5 and 12v, and each device connected typically has power phases to bring the voltage to what it needs.

So, basically, interchangeable memory increases complexity, reduces performance, increases power demand which also increases waste heat, increases material costs, etc. Let's not forget about having to alter the form factor of the PCB as well, since the modules would add to the size.

While theoretically, it sounds like a good idea, the practical implications of switching to something like interchangeable memory modules really is detrimental in all regards other than being able to swap them.

[-]

Gatgat00@reddit

One thing I hate is that the lower tier cards get things disabled when the parts are there. For instance my 4070 ti super has 64mb L2 cache but then enable it to run at 48mb. Like why can't I enable that?

[-]

vivificant@reddit

You technically could enable it. They're all made with the same parts but are sorted by quality and number of working transistors on the chipset

[-]

Johnny4eva@reddit

It's not a new idea. Look up S3 Virge from 25 years ago. The Wikipedia page has a picture of it. You could double the memory of the card from 2MB to 4MB (yes, MB not GB, this was a loooong time ago) using socketable 512KB chips.

[-]

CubicleHermit@reddit

A bit before that, some Macs from the early/mid 1990s had VRAM SIMM modules.

Amusingly, the first Google search result for it comes up with retro modules for sale on Tinie: https://www.tindie.com/products/siliconinsider/rainbowram-6x-256kb-68-pin-vram-simm-macintosh/

6 SIMMs for a total of 1.5MB of VRAM :) Early 1990s (that model was 1991) vs. mid-late 1990s...

[-]

hishnash@reddit

bandwidth, and space. It is very hard to make a high bandwidth interface (as needed by GPUs) work with a socket.

In the end you would end up paying more for the engineering of the socket than you would if you maxed out the memory capacity and soldered it in. To get the needed signals integrity for high bandwidth memroy you cant just use the board edge socket you have for DDR it would be much more like a CPU socket with a load of gold plated pin grid arrays, once you build one of these for each GDDR chip you are spending more on the socket HW than you woudl if you just shipped the board with the maixjmu possible GDDR capacity chips you can buy.

[-]

maifee@reddit

Business

[-]

UnreasonableEconomy@reddit

this is probably the real reason.

[-]

Live_Bus7425@reddit

Not really. If it was easy to do so, competitors could up with a product that gives them the thing customers really need.

[-]

ot13579@reddit

You can buy modded 48gb 4090s on ebay. Chinese companies are doing it.

[-]

Live_Bus7425@reddit

You can also buy an SSD with 8TB of memory. Even Chinese companies are doing it...

[-]

Massive-Question-550@reddit

8tb ssds exist, samsung makes them for example.

[-]

Live_Bus7425@reddit

yeah. But does it have anything to do with GPU's with removable memory? Those modded 48gb cards also don't have removable memory.
Also, I am 100% sure you can make a GPU with removable memory - but will it be fast? The answer is no.

[-]

Massive-Question-550@reddit

I'm not sure if it's a geral rule that socketable ram can't be fast, for example the fastest ddr5 ram speed was 12000 mt/s while gddr6 ranges from 14 to 24 mt/s and I would will expect to see ddr6 ram in the next 2-3 ish years.

[-]

Live_Bus7425@reddit

Compare latency and bandwidth between ddr5 (came out in 2020) and gddr6 (2018).

[-]

UnreasonableEconomy@reddit

AMD? Intel?

AMD and Nvidia aren't competing.

AMD takes the mid market and data center CPU market and Nvidia's got the top end and the data center GPU market.

Intel fell off a cliff, not sure where they are right now.

[-]

Live_Bus7425@reddit

I think that AMD and Intel are competing with NVidia. If they could introduce mechanical system for customer to easily upgrade vram - that would be very popular. Its just not possible for many reasons that are more in the "physics" realm, than business...

[-]

UnreasonableEconomy@reddit

Its just not possible for many reasons that are more in the "physics" realm

that is and has been bullshit for quite a while now. there's no reason why vram couldn't be socketed, like a cpu for example. It would make hardware more expensive, sure, but the "physics" argument is a lie.

[-]

Live_Bus7425@reddit

I'm sorry, but the CPU is fundamentally different. I started writing a long post, but I feel like you will probably downvote it without reading and say that I am just a dumb redditor. Besides people learn better when do their own research than when they are given facts on a plate. So if you do care about the answer, research the following topics:
1. What is the latency between CPU and RAM? What about latency between GPU and VRAM?
2. What is the max bandwidth between CPU and RAM? What about max bandwidth between GPU and VRAM? Compare DDR5 to GDDR6.
3. What is the difference in tasks that CPU and GPU are working on? Which one would suffer more from increased latency?

[-]

UnreasonableEconomy@reddit

I haven't downvoted any of your posts, but I'm not gonna argue with you on this. There's a million reasons you can invent why it "can't" work. If you lack the creativity and engineering mindset required to imagine what would need to be changed to make it work, I don't see a way to discuss this with you. I'm sorry for that. But have a nice day anyways!

[-]

Live_Bus7425@reddit

No worries. Its always good when two human beings don't agree, but can live in harmony and don't resort to insults. You have a nice day as well!

[-]

Tempotempo_@reddit

If I had to bet, I’d say :

Soldered => short distance from gpu to memory

Short distance GPU/memory => information arrives quicker, less loss

Less loss => less error correction Less error correction => more effective power

Information arrives quicker => shorter memory wait times Shorter memory wait times => less wasted clock hits

Soldered VRAM => more difficult to expand, therefore spit out the $$$$$$$$$$ if you need more

I’m not completely certain about the significance of the loss and corruption when it comes to GPUs, but it’s a physical limitation nonetheless.

Please correct me if I said something incorrect or superfluous.

[-]

Massive-Question-550@reddit

what stops board makers from putting the ram slots on the back directly behind the cpu? that would be a much shorter distance.

[-]

Tempotempo_@reddit

My guess is that the slots would add some latency and signal degradation due to the added connectors.

Also, to have low error rates on super fast memory like GDDR6 and GDDR7, you'd need to be as close to the GPU as possible. I wouldn't be surprised if the next generation of GPUs is entirely made of SoCs.

Also, you'd need to cool this VRAM. If you put the modules behind the chip and close enough to it, depending on the orientation of the card, it will either heat up the chip or receive its heat. It's a massive problem.

All of these design issues would make the cards more expensive.

Finally, most people have enough VRAM for their use-cases (gaming, some CAD...). Only AI users/devs want a shitton of VRAM, but companies have the $$ to buy their professional (and more reliable) cards/servers/bigass-shelves full of H200s, the most "serious" experimenters (those who have the ability to affect their company's decisions on AI), which only leaves the (sizeable) niche of people who want to do AI for fun.

As we know, these days NVIDIA isn't really desperate for money, at least not on the prosumer/enthusiast market.

[-]

Massive-Question-550@reddit

Seeing the triple offering by Nvidia, amd, and now hp for fast unified memory AI capable systems I think they definitely recognize the effort and are making these machines surprisingly affordable compared to buying a stack of 4090's.

[-]

Appropriate-Sort2602@reddit

Latency: the closer the memory to the chip the less time it takes to retrieve the info. Some of them even prefer GPU memory on the chip.
Throughput: Unlike CPU, GPUs process a lot of computations parallely. So they need lots of channels. That's why GPU memory is GDDR7 while RAM is only DDR5

In spite of all this, if some application crosses the GPU memory, it is off allocated to the RAM. So in a way you can say you can have a upgradable memory.

[-]

eding42@reddit

You’re wrong about point 2, the names have nothing to do with the bandwidth or number of channels.

[-]

Apart_Reflection905@reddit

Same reason laptop manufacturers solder on ram. $500 base model nobody wants, 8-16 more gigs of ram, $2500.

Yes speed, integrity, yadda yadda. Valid for games. Not so much for LLMs that require 20+ gigs of vram. 1 less token a second is a hell of a lot better than not being able to run it.

[-]

Dr_Karminski@reddit

I'm curious if adding a retimer chip to GDDR or HBM modules can enable a pluggable GDDR/HBM memory module design.

[-]

Minute_Attempt3063@reddit

Speed, mainly.

Your ram that you put on your motherboard won't be as quick as vram. The longer the distance to the die as well, the slower it gets

[-]

Massive-Question-550@reddit

makes you wonder why board makers dont make some models with ram on the back so it can be millimeters away form the cpu instead of centimeters.

[-]

Nosdarb@reddit

Like some laptop mother boards?

Though even when it's on the reverse, it's not usually that close.

[-]

Minute_Attempt3063@reddit

Because it looks ugly

[-]

Delicious-Farmer-234@reddit (OP)

Makes sense, thanks

[-]

Ancient-Car-1171@reddit

Why would they do something like that? Not only it will hurt performance but also their profit.

[-]

Substantial-Bid-7089@reddit

vram needs low latency, it can't travel down a bus

[-]

InterstitialLove@reddit

On the one hand, you have a single really smart guy building an intricate LEGO display. You have a warehouse across town with every kind of LEGO he might need, and there's a shelf full of pieces right next to him, and there's a driver who tries to predict what he might run out of next and brings it from the warehouse to the shelf.

If the warehouse is too small, you can switch to a different one without disturbing the LEGO master to much. The best way to speed it up is to give the master (well trained) assistants or give him caffeine.

On the other hand, you have a factory floor with a thousand guys all making identical LEGO models from simple instructions. They don't stop to think as much, they just go and go. Each guy has a shelf, but you can't have thousands of drivers going back and forth to the warehouse, and too few drivers would slow down the whole affair.

In this case, you really wanna have the warehouse as close as possible to the factory floor. This setup just goes through way more LEGO pieces per minute, so transportation is a bottleneck.

[-]

5477@reddit

Removable DDR5 RAM is far slower than graphics memory. For example, GDDR7 memory operates at around 32000 MT/s, while DDR5 is around 6000 MT/s. The fastest removable LPDDR5X CAMM modules that have better signal integrity, top out at around 9000 MT/s.

This means that you'd need much wider memory bus to achieve same bandwidth, meaning a higher costs.

[-]

Just_Maintenance@reddit

Speed.

CPUs have 128bit memory bus, with memory running at up to 8.5GT/s (and most PCs actually go up to \~6GT/s)

RTX 4090 has a 384bit bus running at 24GT/s. 5090 has 512bit bus at 28GT/s.

To achieve those speeds you need extremely short traces and perfect contact.

[-]

mycall@reddit

What if Nvidia used 3D VRAM?

[-]

Delicious-Farmer-234@reddit (OP)

Wow that's crazy !

[-]

Relevant-Ad9432@reddit

idk the details, but seeing apple and several windows laptops (zephyrus g16), it seems like soldered ram is a lot faster

[-]

Calcidiol@reddit

Correlation isn't causation. Though there are various kinds of particularly faster ram that aren't commonly socketed, that doesn't mean they can't be, just that they most typically HISTORICALLY ware not. q.v.

https://www.tomshardware.com/news/samsung-unveils-lpcamm-up-to-128gb-of-ddr5-in-60-less-space

https://www.ifixit.com/News/95078/lpcamm2-memory-is-finally-here

https://www.corsair.com/us/en/explorer/diy-builder/memory/what-are-camm2-and-lpcamm2/

[-]

Chiff_0@reddit

I mean, you still can upgrade memory, I think there are custom 24gb RTX 2080ti’s out there, but Nvidia is actively making it harder because Vram is a big selling point. Imagine having a 48gb rtx 3090, theoretically you can, but there are issues with the drivers.

[-]

a_beautiful_rhind@reddit

Someone posted 48gb 4090s not too long ago.

[-]

Chiff_0@reddit

Wait, it works?!

[-]

a_beautiful_rhind@reddit

They might be transferring the whole thing to a new PCB and have hella surface mount tools so I don't know if you can home game more memory with a hot air station alone.

[-]

Chiff_0@reddit

I don’t think the problems are with installing memory chips themselves, it’s possible at home and the 3090 has 24 locations where you can mount them. The problem is on the software side. The fact that somebody got it to work shows how lazy Nvidia actually is and how they’re purposefully keeping the ram sizes small.

[-]

a_beautiful_rhind@reddit

There is the bios and also lack of address lines on the PCB. That was the issue with the 3090, it would be unable to access the memory.

[-]

Southern_Sun_2106@reddit

Too many barriers set up over the decades that are preventing new players from entering the market and doing things that would potentially result in market share loss by the existing players. That's why we only have a handful of companies in that space, and things have not changed like in a looong while. The question is, what other nations can challenge this system? They are our only hope for increased competition and radically new approaches.

Now memory is part of the business strategy for Nvidia. Asking them to change anything there is like trying to persuade apple not to solder SSD and RAM.

[-]

goingsplit@reddit

the better question is: why aren't there discrete GPUs without memory, and mobos with GPU sockets to pool shared memory for a bunch of GPUs?

[-]

fuckingpieceofrice@reddit

Physics.

[-]

deadzol@reddit

We’ll never beat the latter, but maybe there’s hope to overcome the former. 😂

[-]

Jattoe@reddit

You smart alec

[-]

bangbangracer@reddit

Signal integrity and latency. Every mm of trace and connector makes things more complicated and more difficult to time perfectly.

Honestly, after having grown up in the era of fast RAM and slow RAM being normal parts of computers, stuff like GPUs having fixed RAM amount or Apple putting the RAM right on the CPU substrate makes sense to me.

[-]

Doormatty@reddit

fast RAM and slow RAM being normal parts of computers

Ex-Amiga as well?

[-]

bangbangracer@reddit

You'll have to pry my Amiga 1200 from my cold dead hands.

[-]

Doormatty@reddit

I miss my Amiga 500. Wish I had never gotten rid of it.

[-]

treksis@reddit

business

[-]

Autobahn97@reddit

There is anew pluggable memory called CAMM - Compression Attached Memory Module. It supports higher speed connections but currently not fast enough for the high bandwidth memory (HBM) that a GPU uses but maybe this standard will evolve in the future to be much faster or GPU cards will be architected to have both HBM but also a secondary cache of CAMM.

[-]

a_beautiful_rhind@reddit

There is a new form factor for DDR where this might work but they have zero incentive to make it a reality.

[-]

Lord_of_Many_Memes@reddit

you can probably plug in DDRs via PCIE on the motherboard but HBM is placed too closed to the chip to be modularized.

[-]

Won3wan32@reddit

latency

[-]

Small-Fall-6500@reddit

The GPU is the removable memory, what do you mean?

/s

[-]

Vlad_The_Impellor@reddit

You're not wrong.

[-]

Small-Fall-6500@reddit

For LLM inference, basically it is actually the case, yeah. That's what my 3090 is lol.

Nvidia or AMD or even Intel could (and should) probably sell "GPUs" like this, with some cheap / old VRAM slapped onto a pcb with a crappy, tiny chip that gives just enough performance to utilize the VRAM for inference.

[-]

Chelono@reddit

that "cheap" VRAM for businesses is essentially the entire current CDNA lineup for AMD. No reason for them to cut into that business by making affordable consumer cards... The in between are clamshell mode GPUs, but those are sold in the Pro lineups (e.g. W7900: \~$3000. Doesn't have matrix cores like the CDNA lineup so it is essentially a crappy chip compared to CDNA for AI).

[-]

76vangel@reddit

Everything can be standardized. Manufacturer simply don't want it. You need more than 16 GB? Buy an ultra expensive 3090/4090/5090. Or an AMD card with all it's problems and abysmal performance with AI and raytrace.

[-]

BloodSoil1066@reddit

At some point it may become useful to have Edge Compute that can run huge pre-trained models for the home, but slowly. So you'll be able to buy a low clocked CUDA processor and 128Gb of RAM. It's unlikely that we will ever find an application that is OK with running slowly, unless it's filling in tax forms or something

But right now, people either want gaming GPUs or Training Compute

[-]

easythrees@reddit

AMD used to have a professional card that allowed SSDs to be added on for extra storage, not sure if they still do.

[-]

Existing_Freedom_342@reddit

The only real valid answer is this: business. There is no other. Any technical argument made here is possible to get around (see, I'm not saying it's easy, but it's possible). Expensive and cheap laptops do this with RAM, even though the criteria are different. Just remember that planes (a metal object that weighs tons) fly.

[-]

arghcisco@reddit

It might have been feasible up to GDDR5, if JEDEC members saw a business case for it, but they didn’t. Allowing the memory to be socketed makes everything about the standard more expensive, because both sides of the memory bus require additional logic and firmware to deal with tiny imperfections in the socket which make the electrical connections different lengths, and to share data pins so you can have more than one chip per memory board.

Regular memory makes these trade offs because of how many chips are required to make a practical computer around a CPU and the lower bandwidth requirements, but GPUs need so much bandwidth that sharing a bus isn’t practical because it would make the GPU pin count explode, jacking the price up for a feature most users don’t really need.

For GDDR6+, it resembles HBM more than regular DDR memory, so any socket solution would either be too expensive for consumer products, or unable to be replaced by the end user without expensive special tools, due to the precision and cleanliness required.

For example, a test socket for GDDR6 costs over $100 per chip, and even then it requires handling the chips with gloves and won’t be reliable without cleaner than normal air.

[-]

The_GSingh@reddit

Speed and $$$$.

[-]

SamSausages@reddit

$

[-]

charleysilo@reddit

Removable ram is like buying a generic part for your car - it might work, cause some issues, manufacturing defects and some significant trade offs. Ie, keeping the chip on the board means quality control, direct programming/use cases/installation uniqueness for specific use cases. Ie, it’s faster, cheaper, and better for the gpu manufacturing process. Also can be installed much closer to the processor - latency etc issues. Heat distribution and case design plays into this as well.