Grüße aus Shenzhen: Wir haben ein NAS-Gehäuse entwickelt, das DeepSeek R1 70B lokal packt (20 t/s mit interner 4090). Feedback erwünscht!

Posted by Maleficent_Cap9844@reddit | LocalLLaMA | View on Reddit | 23 comments

https://reddit.com/link/1p8aul3/video/qh2bk1u0pu3g1/player

Moin zusammen,

Ich arbeite aktuell hier in China (Shenzhen) bei einem kleinen Hardware-Team namens Harbor. Wir sitzen quasi direkt an der Quelle der Supply-Chain und haben das letzte Jahr damit verbracht, ein Problem zu lösen, das uns selbst genervt hat:

Wir wollten große LLMs (wie Llama 3 oder DeepSeek) lokal hosten, ohne unsere Daten in die Cloud zu blasen. Aber die Optionen waren Mist:

Mac Studio: Super teuer und man ist im Apple-Ökosystem gefangen (kein CUDA).
Server-Rack: Zu laut für das Home-Office.
Standard-PC: Verbraucht zu viel Strom im Idle und ist kein echtes NAS.

Also haben wir hier vor Ort einen Prototypen entwickelt: Ein NAS-Chassis, das kompakt auf dem Schreibtisch steht, aber Platz für eine interne Full-Size GPU (bis zur RTX 4090) bietet.

Die Technik (Benchmarks): Wir haben das Ding die letzten Tage mit verschiedenen Modellen gestresst (Setup: Nexus Chassis + Ryzen 7 PRO 8845HS + interne RTX 4090):

DeepSeek R1 70B (Deep Reasoning): Stabile \~20 Tokens/Sekunde. Das ist schnell genug für flüssigen Echtzeit-Chat mit GPT-4 Level Intelligenz. (Unten auch noch mal einen weiteren test mit AMD Radeon PRO W7900 mit 12 Tokens/Sekunde)
32B Modelle (High Speed): Hier kommen wir eig fast immer auf \~ 40 Tokens/Sekunde. Der Text fliegt förmlich über den Screen, schneller als man lesen kann.

Da der Ryzen 7 sehr effizient läuft, bleibt genug thermischer Spielraum, um die GPU in dem kompakten Gehäuse kühl zu halten (wir haben separate Luftkammern designt).

Jetzt zum eigentlichen Punkt (und warum ich hier poste): Wir sind vor einer Woche auf Kickstarter gestartet und – Hand aufs Herz – der Start war bisher eher schleppend (um nicht zu sagen: ziemlich zäh).

Wir sind halt eher Ingenieure und keine Marketing-Profis. Vielleicht haben wir das Problem falsch erklärt, oder der Preis für das Barebone-Kit ($799) schreckt ab, weil die Leute denken, da wäre keine CPU drin (ist sie aber, Ryzen 7 ist fest verlötet).

Mich würde eure ehrliche Meinung interessieren: Ist das Konzept "All-in-One" (NAS + AI Server) für euch interessant, oder baut ihr euch sowas lieber komplett selbst aus Einzelteilen zusammen? Übersehen wir hier irgendwas Offensichtliches, was euch vom Backen abhalten würde?

Link ist in den Kommentaren. Bin für jedes brutale Feedback dankbar, damit wir das Ruder noch rumreißen können.

Viele Grüße aus China

[-]

Evening_Ad6637@reddit

**English (German below)**

My honest opinion, and don't take this personally: Something is completely wrong with your description. You say Llama-3.3-70B, unquantized on an RTX 4090 -> That's not possible. Unquantized, this model is about 150 GB, and the RTX 4090 only has 24 GB.

However, your screenshots show that you're not using an RTX 4090 at all, but an AMD Radeon Pro 7900 with 48 GB.

Also, you're using ollama (which is a bad idea when it comes to performance, by the way), and the model specified there is provided by ollama as a Q4_K_M quant - another reason not to use ollama. Their model naming is intentionally misleading (https://ollama.com/library/deepseek-r1:70b).

Otherwise, regardless of the criticism above: Having a compact case that fits a 35 cm long and three-slot wide GPU is a nice idea, but honestly, that's about it. I would probably only be interested in the case and wouldn't spend more than 100 euros on it. As an average consumer, I don't need a huge amount of hard drive space - and I actually prefer to put together the motherboard, CPU, RAM, etc., myself. On Kickstarter, it says it comes with 2*16GB of RAM. That could be seen as a bad joke in this sub. A workstation should have at least 128 GB, with 64 GB being the absolute minimum.

From a business perspective, the whole thing would be too little and too small for me.

Personally, I feel like your idea is kind of 'neither here nor there'.

---

**German**

Meiner ehrliche, nicht persönlich gemeinte Meinung: Irgendwas stimmt mit deiner Beschreibung absolut gar nicht. Du sagst Llama-3.3-70B, nicht quantisiert auf RTX 4090 -> Das ist nicht möglich. Nicht quantisiert ist dieses Modell etwa 150 GB groß und die RTX 4090 hat nur 24 GB.

In deinen Screenshots sieht man aber, dass du gar nicht RTX 4090 verwendest, sondern AMD Radeon Pro 7900 mit 48 GB.

Außerdem verwendest du ollama (was übrigens eine schlechte Idee ist, wenn es um Performance geht), und das dort angegebene Modell wird wird von ollama als Q4_K_M Quant bereitgestellt - ein weiterer Punkt, weshalb man nicht ollama verwenden sollte. Deren Modell-Benennungen sind absichtlich irreführend.

Ansonsten, unabhängig von obiger Kritik: Ein kompaktes Gehäuse zu haben, in dem eine 35 cm lange und drei Slots breite GPU passt ist eine nette Idee, aber das war's ehrlich gesagt auch schon. Ich wäre vermutlich nur am Gehäuse interessiert und würde dafür nicht mehr als 100 Euro ausgeben. Als Otto-Normal-Verbrauch brauche ich keinen enorm großen Festplatten-Speicher - und Mainboard, CPU, RAM usw stelle ich persönlich tatsächlich lieber selber zusammen. Auf Kickstarter steht, dass 2*16GB RAM verbaut wären. Das kann in diesem Sub hier als schlechter Scherz empfunden werden. Eine Workstation sollte mindestens 128 GB haben, Schmerzgrenze 64 GB.

Aus Unternehmer-Sicht wäre mir das Ganze zu wenig und zu klein.

Ich persönlich empfinde eure Idee irgendwie als 'Nichts Ganzes, Nichts Halbes'.

[-]

Maleficent_Cap9844@reddit (OP)

I am also trying to figure out the comparison. e.g. the UGREEN NASync DXP4800 costs now during cyber monday 599 usd. however, from a specs perspective apart from the GPU, the nexus is better in almost any dimension, 8gb ecc vs up to 96gb, 1 10gbe + 1 2.5 gbe vs 2 , stronger CPU etc etc. so I am really trying to figure out how to really show the value. just to make sure I am not trying to defend myself here :D really just trying to figure out where the exact issue is

[-]

Evening_Ad6637@reddit

I understand what you mean. I think the problem is that it's difficult to position this product as an LLM powerhouse. It's a powerful, compact NAS server and looks really nice - there's no doubt about that.

But mentioning LLMs in this setting, especially a heavyweight (dense) beast like an unquantized 70B Llama, seems pretty misleading to me. I mean, it's basically impossible to run this model on top of this hardware-foundation since there's no PCIe card with that much capacity.

But even if you were to quantize it to q8.0, you'd still have to pay around $7,500 for an RTX Pro 6000 Blackwell to be able to run the model, and then only with low context.

I'm no marketing expert, but from the perspective of a consumer who is familiar with LLMs, I find that the current presentation creates quite high expectations, and I can well imagine how bad the disappointment might be afterwards.

[-]

Maleficent_Cap9844@reddit (OP)

hey thanks for the response. so I forgot to mention in the post, as you have mentioned we were using an AMD Radeon Pro 7900 for this particular test which reached around 12 tokens per sec. we also did another test with 4090 48g which reached 20 tokens per sec.