Been out of the loop - Will this work for EXO/MLX?
Posted by NoUsual5150@reddit | LocalLLaMA | View on Reddit | 3 comments
Had to sell my AI server and am down to an M4 Macbook Air 16GB.
If I were to buy a used M1 Air with 16GB (run it headless) and connect the two via EXO + Thunderbolt...would it be possible to be able to run a (19.6GB) Qwen 3.5-27B-Q5_K_M.gguf at or around 10 tokens per second?
I have been out of the loop for over a year and trying to see if this proposed configuration would work.
Longjumping_Crow_597@reddit
My initial instinct when reading this is definitely no, you won't be able to achieve this. But as I was writing the analysis, it actually seems potentially do-able.
Both MacBooks don't have Thunderbolt 5 which is required for RDMA. Therefore you'll be doing ordinary pipeline parallel over TCP/IP.
The M1 Air only has 68.3GB/s of memory bandwidth. As a quick back of the envelope calculation for the theoretical maximum TPS for a dense model you can divide the memory bandwidth by the size of the model. Even the M4 Air has only 120GB/s. Imagine you could fit the entire model on the M4 Air - you'd get 120/19.6 = 6.122TPS. Now, in practice you'll run part of the model on the M1 Air and part on the M4 Air. Let's be optimistic and say you can fit 12GB on the M4 Air, leaving 19.6-12 = 7.6GB on the M1 Air. In that case, the total time to do one forward pass would be 12/120 + 7.6/68.3 = 0.211 sec, or 1/0.211 = 4.739TPS. That's the theoretical maximum.
Now, that's with no spec. decoding. With spec. decoding (you'd likely want to run the draft model on the M4 Air), you could potentially get this >10 tok/sec. Of course with spec. decoding it depends on what's being generated.
Creepy-Bell-4527@reddit
Do you mean using the thunderbolt as a 40gbps tcp/ip backbone for Exo? Because you won't get Thunderbolt 5 or RDMA for either of those devices.
NoUsual5150@reddit (OP)
Yes. Ok, so this is a dumb idea?