Llama 405B running locally!

[-]

nomorebuttsplz@reddit

What was the tokens per second once you added the 3090?

Reply

[-]

kryptkpr@reddit

exo looks like a cool distributed engine, and with MLX looks like performance is really good. this is a 4bit quant? so you're pushing like 500gb/sec through the GPUs, that's close to saturated!

Reply

[-]

ifioravanti@reddit (OP)

4bit, I'm now trying to add NVidia 3090 to the cluster, using tinygrad

Reply

[-]

Short-Sandwich-905@reddit

Keep us posted, I wonder how that will impact tokens/sec

Reply

[-]

Euphoric_Contract_96@reddit

Hi, are we able to scp the downloaded models from one machine to another machine? as scp usually faster than download them one by one in different machines, thanks a lot!

Reply

[-]

ifioravanti@reddit (OP)

153.56 TFLOPS! Linux with 3090 added to the cluster!!! https://preview.redd.it/5vr48uvg20pd1.png?width=2000&format=png&auto=webp&s=5870e572e29cb9d3c941f3ddbec42379d1db071e

Reply

[-]

Did you have any trouble with CUDA out of memory errors when adding Nvidia to the cluster? I got Exo working great when using just Mac machines but I haven't gotten it to work correctly with Mac machines plus Linux/Nvidia

Reply

[-]

Thomas27c@reddit

How are you connecting them together? WIfi, ethernet, usb thunderbolt?

Reply

[-]

Short-Sandwich-905@reddit

Dial-up

Reply

[-]

MoneyPowerNexis@reddit

Telegraph

Reply

[-]

min2qaz@reddit

Pigeons

Reply

[-]

Kenny741@reddit

Smoke signals

Reply

[-]

Shoddy-Tutor9563@reddit

Messenger on a horse

Reply

[-]

pmp22@reddit

wood screws

Reply

[-]

ifioravanti@reddit (OP)

wifi

Reply

[-]

toodimes@reddit

Bluetooth

Reply

[-]

visionsmemories@reddit

🤣

Reply

[-]

MoffKalast@reddit

The factory must grow.

Reply

[-]

Evolution31415@reddit

https://preview.redd.it/qwv3kt25w0pd1.png?width=1185&format=png&auto=webp&s=8f25d1655182492cf4a56f284c24e682b0c95c90 Can we add 4x5090 rag my lord?

Reply

[-]

quiettryit@reddit

Loved that game!

Reply

[-]

ProtoSkutR@reddit

This first value... is 400GB, seems too high: sudo sysctl iogpu.wired\_lwm\_mb=400000

Reply

[-]

JacketHistorical2321@reddit

How far out is llama.cpp support?

Reply

[-]

ortegaalfredo@reddit

Perhaps you could try deepseek-v2.5, about same score than 405B, sometimes surpassing it, but much faster, I bet you could it 30 t/s on that setup. Too bad deepseek arch is so poorly supported.

Reply

[-]

spookperson@reddit

I experimented with deepseek-v2 and deepseek-v2.5 today in both exo (mlx-community 4 bit quants) and llama.cpp's rpc-server mode (Q4\_0 ggufs). I have a M3 Max Macbook with 64gb of ram and an M1 Ultra Studio with 128gb of ram (not the highest end gpu cores model though). I was only able to get 0.3 tok/s out of exo using MLX (and that was over ethernet/usb-ethernet). But on llama.cpp RPC it ran at 3.3 tok/s at least (though it takes a long time for the gguf to transfer since it doesn't look like there is a way to tell the rpc-server that the ggufs have already been loaded on all the machines in the cluster). It could be that I have something wrong with my exo or MLX setup. But I can run Llama 3 8B with MLX at 63+ tok/s for generation so I dunno what is going wrong. Kind of bums me out about being able to run a big MoE quickly in a distributed setup

Reply

[-]

Evening-Detective976@reddit

Hey u/spookperson , I'm one of the repo maintainers. This is unusual. I'm getting get >10tok/s on Deepseek v2.5 across my two M3 Max MacBooks. My suspicion is that it is going into swap. Make sure you run the \`./configure\_mlx.sh\` script that I just added too which will set some configuration recommended by awni from MLX. Could you also run mactop (https://github.com/context-labs/mactop) to check if it is going into swap. Many thanks for trying exo!

Reply

[-]

spookperson@reddit

Thanks you u/Evening-Detective976 - that is super helpful! mactop is a great utility, I hadn't seen that before. I think you are right probably right about going into swap. And I appreciate you adding Deepseek 2.5 in the latest commits!! I'll test again today

Reply

[-]

spookperson@reddit

Hmm - ok so I pulled the latest exo from github on both machines. I ran pip install to get the latest package versions. I rebooted both the Ultra and the Macbook. Then after that I ran the configure\_mlx.sh scripts on both machines and started mactop and exo. I can confirm that there is no swapping now, but I'm only seeing 0.5 to 0.7 tok/s when running mlx-community/DeepSeek-V2.5-MLX-AQ4\_1\_64 (which is better than what I had yesterday at least!) I noticed in Twitter that people are saying MacOS 15 gets better performance on larger LLMs. So I'll try updating the OS and try exo again.

Reply

[-]

Evening-Detective976@reddit

Updating OS might be it! Also I just merged some changes that should fix the initial delay that happens with this model in particular since it involves code execution.

Reply

[-]

spookperson@reddit

Ok good news! Two things made a big difference. When I re-created a whole new virtual Python environment and just pip installed the latest exo (and nothing else) - I got the cluster up to 3 tok/s (from 0.5) on DeepSeek-V2.5-MLX-AQ4\_1\_64 - so something strange was happening with dependencies/environment between exo and a former mlx install. Then when I got both machines upgraded to MacOS 15 - the cluster is now at 12-13 tok/s!! Thanks again for all your help u/Evening-Detective976

Reply

[-]

Evening-Detective976@reddit

That is good news! I'll update the README to suggest using MacOS 15. Please let me know if you run into any more issues or have suggestions for improvements!

Reply

[-]

Expensive-Paint-9490@reddit

In my real-world experience Llama 405B is way better than DeepSeek. Which is hardly surprising, considering it's a dense model vs a MoE half its size.

Reply

[-]

ResearchCrafty1804@reddit

Indeed, deepseek-V2.5 being MoE would run much faster and its performance is on par with Llama-405b

Reply

[-]

dogcomplex@reddit

Any idea what kind of network traffic that's producing between devices, and latency? This is fascinating, especially if we could adapt it into swarm training over the internet...

Reply

[-]

chrmaury@reddit

I have the M2 Ultra Mac Studi with 192gb ram. You think I can get this running with just the one machine?

Reply

[-]

Maristic@reddit

I ran the K2 version (with llama.cpp) on my Mac Studio, and it did work, but it was pretty glacial.

Reply

[-]

ifioravanti@reddit (OP)

Nope, you need at least 229GB of RAM to run the q4 version and the q2\_k on ollama requires 149GB

Reply

[-]

Roidberg69@reddit

How do the benchmarks of q2 compare with fp 16 and 70b fp16?

Reply

[-]

claythearc@reddit

I’ve been running q2 70b locally on a 40gb card and it’s a waste of time compared to q4. It’s not apples to apples but I assume there’s some correlation.

Reply

[-]

kao0112@reddit

is it quantized?

Reply

[-]

ifioravanti@reddit (OP)

yes 4bit

Reply

[-]

Aymanfhad@reddit

Wow 2.5 t/s is playable

Reply

[-]

MoffKalast@reddit

On the other hand 30.43 sec to first token with only 6 tokens in the prompt is uh... not great. But still it's impressive af that it even runs.

Reply

[-]

nero10579@reddit

I mean it's on wifi interconnect lol

Reply

[-]

estebansaa@reddit

very cool! Im wondering wether there is some business model on a farm of mac studios doing lots more tks.

Reply

[-]

quiettryit@reddit

For the cost of hardware I'll just pay a subscription, still cool though!

Reply

[-]

askchris@reddit

Would Exo work for turning say 10 CPU only laptops into a viable cluster for running 70B to 405B LLMs (extremely slowly)?

Reply

[-]

GreatBigJerk@reddit

You can even use Android and iOS devices, so probably!

Reply

[-]

drosmi@reddit

I umm might have enough hardware to do this…. So cool.

Reply

[-]

Thomas27c@reddit

This is really cool and inspiring thanks for sharing. I would love to try using exo to pool my devices processing power together.

Reply

[-]

fallingdowndizzyvr@reddit

It's easy to pool devices with llama.cpp. I do it everyday.

Reply

[-]

spookperson@reddit

Any advice/thoughts on llama.cpp multi-device pooling vs exo? I'm curious about speeds. I imagine exo has less quant options

Reply

[-]

fallingdowndizzyvr@reddit

I don't know anything about exo so I can't comment on that. RPC llama.cpp works pretty well although there is definitely a penalty in performance using it. But it's a work in progress. A change from a week or so ago made it up to 40% faster than it was.

Reply

[-]