Does the new Jetson Orin Nano Super make sense for a home setup?

[-]

kyuubi840@reddit

No. It's a dev kit meant for robotics and similar stuff. It has an ARM CPU, it requires specially compiled torch and onnxruntime. I wouldn't be surprised it it also required specially compiled llama.cpp and others. Nvidia provides some binaries, but you need to search for them and match to the jetpack version you're running.

If you're savvy, it can be fun to have. It's like a very performant Raspberry Pi. But I wouldn't recommend it for general tinkering.

[-]

WritingStrong8516@reddit

I’m personally considering the jetson or an old Tesla M80. The Tesla (plus my existing pc) appears to be more powerful option by far, but the jetson does a decent job while at just 10% of the wattage of the M80 gpu alone, much less my whole system. What do you all think?

[-]

Regret92@reddit

Did you end up choosing?

[-]

Single_Sea_6555@reddit

What size llama3.2 did he get 21 tok/s for? (I couldn't make it out in the video.)

That might be usable, and might be worth comparing with existing 15W Intel and GPU options.

[-]

Old_Atmosphere_9406@reddit

He used both the 70B and 3B. He also added an SSD to his device.

[-]

privilegedbot-maga@reddit

How does a 70B model fit into 8GB memory? Does ollama supports some kinds of offloading to SSD features?

[-]

profcuck@reddit

I don't think he did use the 70B. I just watched/skimmed the video, so I could have missed it but I don't think so. Perhaps /u/Old_Atmosphere_9406 can clarify.

[-]

eleqtriq@reddit

You can see him type the command. It’s the default. It’s confirmed.

[-]

petercooper@reddit

Not watched his video but watched ETA Prime testing it and he got about 20 tok/s also. Which makes sense as the model is quantized and takes up about 2GB and on a board doing 60GB/s memory bandwidth, so 30 tok/s would be the maximum. Out of the box unoptimized, that's not bad.

[-]

ggone20@reddit

At the end of the day it’s a Linux box. I’ve used jetson products for a while and yea just depends on your use case. 67TOPS and 6ish gb vram is plenty to do ‘damage’ at the edge.

[-]

octagonaldrop6@reddit

I have Jetson Nano dev kit which is basically the same thing but older/slower. I definitely would recommend these boards for tinkering. Jetpack comes with everything you need to get started with AI. I was able to use it with a TensorFlow image classification model no problem.

If all you want is to run off the shelf models in a GUI it’s probably not for you, but it’s great for tinkering. Mac Mini would definitely be better in that case and more memory = larger models.

[-]

No_Afternoon_4260@reddit

No it's slow 8gb memory and it's shared with the OS so may be 7 or 6 usable. This is meant for robotics (mostly vision and kinematics models) not for llm !

[-]

GoodbyeThings@reddit

are there any comparable options for llms and robotics?

[-]

ReadyCrazy3551@reddit

Nvidia p40 24gb, forbIt's suitable for a home PC, or for your own server with its own AI, but there are no fans and it's a pretty high level of DIY. The regular 3090 is easier to handle, but more expensive.

[-]

muxxington@reddit

The P40s are even sold complete with a 3D-printed flange and fan. What DIY are you referring to?

[-]

ReadyCrazy3551@reddit

It's possible, but I've only seen cards with passive cooling, and even used ones are hard to find in our country.

[-]

muxxington@reddit

Ebay is full of offers like this one.

[-]

GoodbyeThings@reddit

by now you also have those DGX Spark. But they are insanely expensive :D

Thanks for sharing

[-]

No_Afternoon_4260@reddit

Depends of ur budget

[-]

GoodbyeThings@reddit

I guess something like the jetson or up to 500

[-]

No_Afternoon_4260@reddit

Tbh 500 won't buy you the hardware to run any kind of meaningful llm in the way you might think.
Imho small model >22b are really meant for research purposes or highly specific tasks (usually through fine tuning which is expensive in compute).
And >22b are only useful for ao much.

So you might find hardware to run small lms rather slow, which forces you to slow iterations thus slow working/learning.

But you can rent, expect a 3090 for less than a dollar an hour and that is really fast 24gb vram compared to jetson.

The jetson serie is really meant for robotics and vision tasks.

Does that answers you?

[-]

jeremymorgan@reddit

I agree to a point. I could run most 1B and 3B models with it and it does pretty well. I do agree with you that it's probably not it's best use case.

[-]

No_Afternoon_4260@reddit

I agree on that use case you probably won't find better at this price point.

[-]

sourceholder@reddit

Wrong. It's perfect for LLMs, assuming your robot talks the talk and walks the walk.

[-]

random_guy00214@reddit

You would want like 4 of these things. 1 for walking. 1 for LLM with tts and stt. 1 for robotic arm control. 1 for chain of thought thinking.

[-]

No_Afternoon_4260@reddit

In short these are called nano super not mega super lol

[-]

Different_Fix_2217@reddit

No, new intel cards might be the way to go, rumor is they might release a 24GB card cheap.

[-]

Funcron@reddit

Sorry for the necro, found this via Google.

This didn't age well lol

[-]

SlowThePath@reddit

My understanding is that non Nvidia cards are pretty limited for these types of tasks. You need CUDA pretty much.

[-]

GeekDadIs50Plus@reddit

Everyone here has an opinion and not one of them starts with them actually owning one.

Except me. I love my 8GB dev kit. Yes, it was a pain to set up 2 years ago. I was lucky enough to get one from a batch that had the wrong firmware pre-loaded. And the various NVIDIA sources of documentation frequently conflicted with each other.

It’s been operational for well over a year and I use it a lot. Yes, there are quirks, but when it’s doing its processing jobs, it’s rock solid.

It’s Ubuntu. Once the initial setup was complete, I turned off the desktop window manager to save memory. I load a few models but tend to use YOLO-11L the most, for detect and segmentation. It’s all written in python and I Prefect for orchestration and only use Docker on it rarely.

Overall I’m really happy with it as a utility system that supports a ton of models and has a lot of flexibility. Can’t beat the price, either.

Oh, the case? Not even required. But I printed one with my 3D printer. Not everyone has one but there are places you can order prints.

[-]

Weak-Abbreviations15@reddit

Could they be comnected in series to leverage ram gains?

[-]

SlowThePath@reddit

This is what j want to know. It seems odd that it just has 8gb at 250. It makes way more sense to me to have it be 16hb for $350 and 24gb or 36gb for 450 or 500. It would be a way more compelling for me. That said, people have pointed out this thing is aiming at robotics not LLMs.

[-]

DerFreudster@reddit

You can get the Orin Nano board at 16GB:

https://www.arrow.com/en/products/900-13767-0000-000/nvidia?gQT=2

Then load them onto a Turing Pi board.

[-]

SlowThePath@reddit

That would be sick, but it's 700$.

[-]

DerFreudster@reddit

Yeah, Nvidia love$ to rake it in.

[-]

GeekDadIs50Plus@reddit

There is also a 64GB AGX Orin module that can run on a backplane or solo. Packing 2048 GPU cores AND 12-core CPU. Granted, it’s $1,800.

[-]

GeekDadIs50Plus@reddit

I wish, but no.

[-]

DerFreudster@reddit

You could use a Turing Pi board to connect them in series.

[-]

Spare_Salad714@reddit

gostaria de saber se alguem conhece alguma placa que roda IA, preciso que a IA gerencie uns processos de comunicação porem estava de olho nessa Jetson nano super, porem com um pé atras, pelo menos para iniciar o projeto, porque gastar 200 mil em uma placa logo de inicio e complicado

[-]

zahamm@reddit

This runs about 20 tps using dolphin-llama3 uncensored. Totally offline and operates as a WiFi access point: https://github.com/zhamm/jetston-ollama-offline-ai

[-]

monkkbfr@reddit

Dave Plummer says yes, you can, and it works quite well. And he proves it:

https://youtu.be/e-EG3B5Uj78?si=qcIwyC6T9FuNZdj3

[-]

Calcidiol@reddit

The thing you want for a local tinkering setup is a amd64 pc with around 64-96GB DDR5 RAM, maybe a 8 core CPU, and a nvidia GPU probably starting off with (second hand) P40 or 3060 series (higher VRAM, good reputation model,...). Good quality 650-850W PSU (you're more likely to reuse it than many other things over a 7-10 year span).

You absolutely CAN buy less good / less CPU / RAM or even GPU, maybe even look at AMD 7xxx or Intel B580 DGPUs, but I'm suggesting that "base line" not because you CANNOT get interesting results with less but because it's going to have a good probability of serving you well for years to come, have some "future proofing" where you won't regret not having bought better too soon, and you're not likely going to be hugely limited by any single thing.

The GPU limitation is VRAM which is the (by far) EASIEST limit to hit very quickly, so I suggest starting at 16 or 24 GBy VRAM size if that's affordable and seems reasonable because it's the biggest limit on the model size range, model quality you can run at all "fast" and also how much context size you can use for input text / prompts. So REALLY avoid 8-10-12 GB GPUs if possible.

Nvidia not because I am in love with them but they have better SW compatibility by far with LLM inference tools / SW and "how to" guides. It just saves hassle / time unless you're very skilled, informed, determined to jump through more hoops with other options even if they're practicable sometimes.

Why not jetson? Almost all useful LLMs are bigger than CPU cache, gigabytes, multi-gigabytes for the really very useful models. The speed you can multiply your way through N gigabytes of model once is the speed your LLM generates 1 character of output text.

The jetson, while nice, has at most 102 GB/s RAM bandwidth. That'll be around 2x what your PC consumer type CPU/motherboard has. But it's around 1/2 or 1/4 of what you can buy in a $400 give or take DGPU, so your GPU card will be at least as fast if not 2x, 4x faster than the jetson for that reason.

The jetson nano has 8GBy or maybe 16 GBy RAM IIRC. The desktop GPUs have 16-24 GB that's faster, and they've got faster compute (sometimes helpful) and are way easier to program / run software on from your PC as a development environment. Obviously the PC is easier to deal with for accessing SW / files / disk / network etc. too.

So "as a sw developer" you want a PC.

"as an embedded system / IoT gadget / appliance" tinkerer, well maybe you do want an "embedded system" for a small low power "always on" voice assistant or thing to use with "homeassistant" SW. Depends on whether you want a "pc based" voice assistant or a "screw it to the wall in the kitchen" one etc. There's nothing stopping you from using bluetooth speakers and microphones and stuff to send voice from your area of interest to / from a PC wherever your PC is.

PCs take like 10-50W or something running depending on power management etc. etc. If that's a problem for 24x7 use consider something else.

Other maybe suitable options are repurposing some second hand laptop or maybe even tablet / smart phone. Laptops (well PC ones anyway, not so much apple anything!) are easy to develop for and are going to run with less power use and have a built in "UPS" (battery). The GPU is going to sometimes be 1x or 2x the power of the jetson for a newer / better equipped laptop, maybe even worse than the jetson for an older laptop. Maybe you get intel, amd, or nvidia IGPU or DGPU on a laptop, so trade-offs.

Smart phones / tablets (android) are miserable to develop for but "possible" and then may have "just about useful" GPU / LLM capability for selected units but really you'll want to avoid or select carefully.

You can always play with voice assistant type LLM setups "in the cloud" for free using collab, kaggle, huggingface, google, aws, azure, whatever is free / free trial etc. that you can use for a year or what not. See if you like it for text / voice. Learn the SW, some python LLM API / UI / model use.

See what is interesting.

Jetson has ampere generation GPU tech so you CAN run similar cuda / LLM / inference SW on it (at the GPU level) as on a nvidia GPU on a desktop / laptop. At the OS level it'll 99% likely be using linux and either "server" configuration (text / shell / command line / container / web interface) or "limited desktop" configuration where you have a real windowing GUI as well as whatever web interface, CLI/TUI etc. you might also run.

llama.cpp is a good inference engine for model inference but things like 'homeassistant' SW will have more 'ready to download & run' assistant and server and home automation trial options. Then there are the different 'assistant / agent' SW UIs / apps you can run on top of an OpenAPI compatible (or similar 3rd party) LLM API and they'll do whatever with ASR / TTS / voice in / voice out etc. etc.

Easier to play with random app SW on a PC + DGPU system or maybe cloud system than some embedded system SBC thing.

[-]

SlowThePath@reddit

Man, this thing really makes me want to have a small box just to run home assistant and have a local llama voice control of home assistant. Having a small (ideally low power) box dedicated to running models locally is very appealing to me. As a side note, I'd also want it to have it control an led array. I'd love to have something the size of this Orin Nano Super to do that, but from you comment it seems like it's maybe barely possible to so all that, bht not ideal. Sorry I'm thinking outloud here, but basically I'm wondering if yoh know of aomwthinf small that can dj that or if there is a way i can build a small pc that would be decent at running multiple models. The problem I have with the PC is that it doesn't have the pins to control the led array, buht the nanko sniper does. Hmmm, what so you think?

[-]

Calcidiol@reddit

If you're looking for voice control you can get very limited "word / command recognition" speech recognition software that will even work on non-linux / non-GPU / non-NPU type MCUs, so basically fixed set of prompts / functions "voice remote control" e.g. the following for ESP32 MCUs though you could do similarly with the right SW on anything more powerful than that including many raspberry pi / whatever pi / etc. boards. Raspberry pi 5 boards have a real GPU and gigabytes of RAM so you could run real ML audio / speech / image / text models on them to an extent and there's an AI HAT series of add-on boards that accelerate the AIML processing for the raspberry pi board to enable a lot of pytorch / onnx / whatever is compatible to them models to run much faster. IDK about the benchmarks vs. the newer jetson nanos here (which are pretty powerful all things considered).

https://github.com/espressif/esp-sr

https://www.raspberrypi.com/news/raspberry-pi-ai-hat/

https://github.com/espressif/esp-sr

A small PC that runs multiple models could be done with a laptop with good NPU / IGPU built into the SOC CPU chip like some of the newer "AI" enabled family of AMD "mobile / AI" oriented CPUs. Some mini PC (not laptop) makers are starting to make mini PC products based on like the Strix Point "Ryzen AI .... series" CPUs.

One big thing that may be announced or released or who knows at CES 2025 (january 7-10 2025 in a few weeks) are the code name new "strix halo" chips from AMD which leaks / rumors / prerelease briefs suggest will have kind of a big powerful IGPU / NPU by previous mobile AI processing standards. The RAM to CPU interface is supposed (rumor) to be 256 bit wide and use fast LPDDR5-something so that would potentially make it into the 100 GBy/s or better range with unknown GBy attached to the motherboard but since it's oriented for high capability laptops one assumes they'd be offering like 32GB, 64GB, 96GB, who knows range at least. So locallama people have been keeping watch for it (Strix Halo / Ryzen AI Max + series) since it should have maybe more than 2x the RAM to CPU bandwidth of a high tier consumer gaming desktop PC and a decent NPU/IGPU so it might be a cost effective and "decent" way to inference ML models up to 70B-120B+ range without needing to have multiple GPUs with 24GBy VRAM just to have enough fast-enough RAM.

https://videocardz.com/newz/amd-confirms-ryzen-ai-max-300-naming-for-strix-halo

https://mikeshouts.com/minisforum-elitemini-ai370-next-gen-amd-strix-point-ai-mini-pc/

So watching out for the next AI model miniPCs / laptops and their benchmarks for LLM would be good in about 3-4 weeks.

As for having a "small pc" that controls stuff -- why not make the PC somehow communicate to a SBC (either MCU or application processor SOC based) and the MCU can do whatever needed with GPIO I/O pins, SPI, UART, I2C, SPI, CAN, etc. Pretty much any $5-$10 ESP32 MCU board can talk to your PC / network over WiFi or Bluetooth and the PC can send commands to have it control LED boards or whatever you want. You can program it using arduino tools IIRC. Or same concept with a raspberry pi 4 / 5 board with networking talking to your PC / server / cloud that runs bigger LLMs.

Check out the faster TTS options, maybe piper and so on and there should be something that is fast and not that large in its RAM / VRAM requirements.

There are also speech+text input LLMs and speech in / speech out ones IIRC so you might be able to do more with less running models, it'll take some size / speed / trade-off analysis.

https://www.cnx-software.com/

That's a good blog / review site to watch for SBCs / SOCs / MCUs which have AIML relevant capability looking for keywords like "npu", "gpu", "tops" etc. There are a bunch of relevant things for small PCs, embedded SBCs, devkits, etc. and maybe something will be ideal for a given embedded or "server" use case.

Well I definitely would encourage you to brain storm / ideate use cases like you're mentioning and consider "what could it possibly do?", "how could it possibly do it in various ways / approaches / configurations?", "what are the main trade-offs / benefits / limits of various SW/HW etc?".

For learning the SW and just coding / testing some PC and/or cloud based SW development & ML running / serving tools are ideal.

For embedded systems that can do useful things independently or let's say less reliant on connecting to an external server / gateway / cloud device / PC then there are lots of options. That includes these jetson devices, even other things like the raspberry pi boards (which have useful GPUs in the full capability "5" boards and which also recently got some kind of optional add-on NPU ML accelerator board available for some uses IIRC).

"whisper" (ML family of model names which are associated with like a dozen different library/software/app projects to make use of the model technology) based software can do like typically 90-95% accurate automated speech recognition and that's possible to run in real time for live voice to text conversion on a lot of various devices like jetson nano, raspberry pi 5, probably most other linux based + ARM SOC based devkits (orangepi, rockchip IC based, etc. etc.). So then you can take that text and feed it to a really small (if adequate for your needs, or a bigger one if it's suitable for your desires) LLM and have the LLM do something like "tool use", "function calling", or just direct interactive chat back and forth. That's ASR (automatic speech recognition) / voice input type of software including stuff like 'whisper' based SW.

There are TTS (text to speech) models / SW / tools that let it of course speak back to you by converting text to some style of voice you want in various possible languages etc. The text can come out of the response from your assistant LLM.

Then there are multimodal models that intrinsically include the ability in some cases to input a mix which could include images, video, audio / speech, text and which may be able to respond / generate text, images, video, audio/speech as a response.

And then there's all the SW frameworks that mix / match / make use of various ML models to do assistant / agent / Q&A / chat / whatever scenarios using possibly a variety of SW libraries / apps / ML models / embedded hardware pieces, some functions which can optionally be performed by other devices like a home PC/server, possibly cloud service / server connection etc.

Even kind of "low end" (much much less compute / ML capability than the jetson or a base raspberry pi 5) microcontrollers can be used for basic voice input command & control and voice response. You don't get anything like conversational / complex responses but you could do "wake word" processing, or accept commands like "turn the lights on", "turn on the air conditioner", "what time is it?", "Is the dog in the house?" or whatever and it can deal with it based on its own capabilities and respond or it can send interactive requests to other services / servers to fulfill the task.

The original "alexa", "google nest" etc. voice assistants did not have any "built in" LLM capabilities, just enough speech recognition to detect when you were talking to them and send the rest of the speech / request to the cloud server to process / generate the response.

More and more things are doing local ML / LLM / multimodal processing "on device" though so yeah the sky is the limit wrt. what you can set up for your interactive app / assistance / utility given enough processing power. Multimodal models enable a more integrated and often faster combination of audio/speech input processing to generate a LLM output response directly in one model as opposed to having an ASR model feeding text to a LLM etc.

https://github.com/THUDM/GLM-4-Voice/blob/main/README_en.md

https://old.reddit.com/r/LocalLLaMA/comments/1gbzbnp/glm4voice_zhipu_ais_new_opensource_endtoend/

https://github.com/usefulsensors/moonshine

https://petewarden.com/2024/10/21/introducing-moonshine-the-new-state-of-the-art-for-speech-to-text/

https://nexa.ai/blogs/omniaudio-2.6b

https://huggingface.co/NexaAIDev/OmniAudio-2.6B

https://old.reddit.com/r/LocalLLaMA/comments/1hdoplq/omniaudio26b_worlds_fastest_audiolm_for_edge/

https://huggingface.co/Qwen?search_models=audio

https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct

https://github.com/rhasspy/piper

https://rhasspy.github.io/piper-samples/

https://github.com/snakers4/silero-vad

https://huggingface.co/nvidia/canary-1b

https://huggingface.co/metavoiceio/metavoice-1B-v0.1

https://www.ultravox.ai/

https://huggingface.co/fixie-ai/ultravox-v0_4_1-llama-3_1-8b

https://old.reddit.com/r/LocalLLaMA/comments/1dwi7ko/tongyi_speechteam_funaudiollm_voice_understanding/

https://fun-audio-llm.github.io/

https://github.com/erew123/alltalk_tts/tree/alltalkbeta

https://github.com/Vaibhavs10/open-tts-tracker?tab=readme-ov-file

https://github.com/coqui-ai/TTS

https://github.com/fishaudio/fish-speech

https://huggingface.co/fishaudio/fish-agent-v0.1-3b

https://huggingface.co/docs/api-inference/index

https://speechbrain.github.io/

https://github.com/speechbrain/speechbrain

https://www.raspberrypi.com/news/raspberry-pi-ai-hat/

https://github.com/espressif/esp-sr

[-]

Adept-Kaleidoscope13@reddit

I didn't think I was coming here for this? But this is what I didn't know I wanted. Great discussion and resources. THANKS!

[-]

Calcidiol@reddit

Strix halo benchmarks / info apparently are starting to come out in pre-release articles:

https://www.tomshardware.com/pc-components/cpus/mysterious-amd-ryzen-ai-max-pro-395-strix-halo-apu-emerges-on-geekbench-processor-expected-to-officially-debut-at-ces-2025

https://www.techspot.com/news/106003-amd-ryzen-ai-max-finally-emerges-gaming-2.html

https://www.techpowerup.com/329696/amd-ryzen-ai-max-pro-395-strix-halo-apu-spotted-in-geekbench-leak

https://www.tomsguide.com/computing/amd-ryzen-ai-max-plus-395-benchmark-has-leaked-packed-into-a-new-asus-rog-flow-z13-gaming-2-in-1

https://wccftech.com/amd-strix-point-apus-upgraded-lpddr5x-8000-krackan-point-strix-halo-get-96-gb-memory/

https://www.tweaktown.com/news/102183/amds-new-ryzen-ai-max-395-strix-halo-apu-inside-asus-rog-flow-faster-than-7945hx3d-cpu/index.html

https://www.tomshardware.com/pc-components/cpus/amd-strix-halo-rdna-3-5-igpu-rumored-to-launch-under-the-radeon-8000s-branding-up-to-40-cus-and-support-for-lpddr5x-8000-memory

So could be interesting for LLM ML models bigger than one is going to have GPU VRAM for and current consumer PCs are too slow (RAM BW bottle neck) for.

Or for a compact relatively lower power use PC that can run ML models without needing a large / power hungry DGPU to some degree.

[-]

Adept-Kaleidoscope13@reddit

The general argument has been for 12gb gpu's. For anything more intensive than a Virtual Assistant, they are absolutely correct; However, you mentioned that specifically.

POWER consumption is a prime consideration for a Self-Hosted Virtual Assistant, etc. Inital cost of a 12gb RTX 3060 is more than the Orin Jenson Nano Super, and then you throw in the cost of the rest, but tou must also take into account the running power cost.

The average running cost of the 12gb 3060 is 170W, with a typical spiked usage of up to 400W, although that would be rare. However, that doesn't include the rest of the PC. You can't use Sleep-mode for a Virtual Assistant and it still be useable, so you get 'Full Idle,' which runs about 40W for the gpu ALONE, without the PC. Given the cost of electricity, the total becomes prohibitive, running 24/7, especially if you use it for playing media, etc.

The Orin Jenson Nano Super, all told, maxes out at 25W. I'll let you do the math.

For projects like a Virtual Assistant, including both the higher initial cost and ongoing power costs, something like the Nano Super becomes a more sensible option. For other applications, a 12gb gpu definitrly becomes the better alternative.

[-]

Won3wan32@reddit

buy a mac mini m4 , it the same price and much better

[-]

roshanpr@reddit

Jetson nano is $250. Not the same price.

[-]

randomfoo2@reddit

Neither are very good options. For $250 find a used/refurbed 12GB RTX 3060. I see a bunch for that price on eBay right now. It has 112 Ampere Tensor Cores (vs 32 on the Orin) and 360 GB/s of MBW (vs \~100 GB/s on the Orin or Mac Mini). If you don't have a PC to plug it into, go to GoodWill/a local computer recycler or ask anyone you know for a handme down. Basically any box PC made in the past 5-10 years (a single PCIe slot, and a 300W+ power supply) would work.

[-]

roshanpr@reddit

but well more electricity consumption plsu all other parts needed to create the setup, its like apple and oranges

[-]

randomfoo2@reddit

I'm not sure I buy that. "All other parts" = practically any used PC. You could just as easily say that you need to do more work for Orin since you need to find a case for it (or more realistically, that it'll be lot more work since you're constrained to Orin compatible distros and ARM compiled binaries). If the OP has an LLM cluster at work, they already work tech adjacent though so I'm going to call it a wash. The Mac Mini I'd agree is like "apples" and oranges - it'd be significantly harder to do AI/ML tinkering since MPS PyTorch support sucks.

For power, a 3060 will idle at about 10W, and the PC at probably 10-15W. Let's be completely pessimistic and say that it idles at 20W more than an Orin or Mac Mini. At the average residential US power rate of $0.17/kWh we're looking at a power cost difference of <$30/year. If the OP is casually considering a Mac Mini as an option (starts at $600, or an additional upfront cost of 12 years worth of idle power), I'm going to go out on a limb and guess that electricity consumption is also a non-factor. Note also that the 3060 has 3X the memory bandwidth and the same Ampere Tensor cores as the Orin, so when doing actual work, it will, if anything, be more energy efficient (consume less power for work done and go back to idle quicker) than the Orin.

The real question though is the original question: is the Orin Nano Super make sense for tinkering as a voice assistant? Well considering it has 8GB of total memory (6-7GB max for models?) and that a Q4 7-8B model is going to take up 5-6GB by itself, and Whisper Turbo will take up another 2GB+, and most decent voice models another 2-3GB, and that everything will run much slower, I'm going to stick to my conclusion and say that for the money, a 12GB 3060 is the best way to go. Obviously there are more options the OP doesn't mind spending $600+ but you don't get great bang/buck until a 3090, which are a bit overkill (or if you're willing to spend your time fighting driver/framework issues - at $250 the B580 has sweet raw specs, but questionable software compatibility atm).

[-]

roshanpr@reddit

And that may work for you but again a 3060 is not a stand alone portable setup.

[-]

randomfoo2@reddit

"don't have anything at home yet" <- the OP doesn't say anything about a portable setup, you're just hallucinating now.

[-]

Short-Sandwich-905@reddit

The nano has significant value for running voice assistants projects, specially when integrated into small robotics applications as a stand alone device something the 3060 can’t offer as it’s just one component in a larger/expensive setup even if using used parts. You toxic, maybe you have an inclination to block people rather than to engage in meaningful discourse. Most of the specs you shared are factual but raw power alone is not the only factor to consider for all projects. Focus more on contributing constructively instead of inflating your ego. Happy holidays

[-]

smaackle@reddit

let em know king

[-]

roshanpr@reddit

Hey OP, So in the thread users like **u/randomfoo2** thinks they know it all and say the Jetson Nano Is Useless for AI Projects with no real value in contrast to just spending more building a more expensive and non-portable setup with a 3060. Here are some receipts.

### Here’s what you *can* do with it:

- **LLaMA + RAG**: This combination allows your assistant to retrieve and process relevant information, giving you intelligent and accurate responses to queries.

- **Piper TTS**: It uses text-to-speech to respond in a natural, human-like voice.

- **Runs locally**: No need for cloud infrastructure. Everything is processed right on the Jetson Orin Nano—no latency, no privacy concerns.

🔗 [Check out the full tutorial here](https://my.cytron.io/tutorial/jetson-orin-nano-google-gemma-2-personal-assistant)

video tutorial for a more in-depth look at the whole setup: 🎥 [Watch the full video here](https://www.youtube.com/watch?v=2VljMU26Zq4&ab_channel=CytronTechnologies)

In the thread many imply the Nano is too underpowered for a full AI assistant, but turns out, **it’s actually quite the powerhouse** for edge AI, and definitely more than capable of handling this kind of setup.

[-]

jeremymorgan@reddit

I just got one and did an unboxing video here --> https://www.youtube.com/watch?v=JRhAMHxlo3E

I run Ollama on it and test out a few models and post the speed in there.

It's okay for LLMs, I am going to run some vision stuff with it in the coming weeks.

[-]

exponentfrost@reddit

There are some third party Jetson devices as well (aliexpress, etc) that are cheaper that I've been considering as well, due to the price difference and wanting to prototype. If the price is a consideration, that might be an option?

[-]

Icy-Corgi4757@reddit

I think as a completely self sustained approach it is a very very good option. Sure, a 12gb gpu is the same price but that also factors in needing the rest of the components to run it as opposed to something "plug and play" like this.

For someone looking to get into localllms who doesn't have/want a pc build this can be an enticing option to play around. I am getting one tomorrow and am excited to do some testing on it and see how well it runs models.

[-]

TheActualStudy@reddit

You could spend $250 better. If your budget is a base mac mini, a used 3090 would be better than that. A 3090 can run QwQ and the other Qwen2.5-32B series models.

[-]

Spirited_Example_341@reddit

you cant get a used 3090 for 250 lol lol

[-]

TheActualStudy@reddit

Yeah, it was two thoughts. A Jetson Orin is $250, but a 3090 and a mac mini aren't too far apart in price. I could have phrased it more clearly.

[-]

Initial-Image-1015@reddit (OP)

I don't really have a computer to put the 3090 in, so that should be taken into account as well. I believe this makes it more expensive.

[-]

ArsNeph@reddit

The thing is, the highest VRAM GPU you can get for like $250 is a RTX 3060 or Arc B580. The B580 is a bit finicky in terms of software support. However, that'd leave you like $350 to build the rest of the PC. I don't think you'd be able to build much with that, unless you live near a Micro center. A Mac mini is incredibly powerful, and you can allocate 12GB of unified memory to LLM use cases. However, a Mac mini is not upgradeable. A PC is. The simple question is, do you see your compute needs going up anytime soon? If so, you may want to consider saving more for the PC option

[-]

g-nice4liief@reddit

You can use a GPU in a enclosure. Using thunderbolt or USB-C should be enough as you're not gaming on it.

[-]

MoffKalast@reddit

That's another $150 for the enclosure and probably $50+ for a PSU, so really it's more like $550 extra.

[-]

Spirited_Example_341@reddit

well the 3090 itself wont be 250 if you can find one for that price dont get it. lol

[-]

cyberdork@reddit

Just wondering, so you could connect a 4090 in a egpu housing to a MacBook via thunderbolt and get decent performance for things like ollama and compfyui?

[-]

g-nice4liief@reddit

Apple silicon does not support it, but if you could use a intel based machine it should be possible

[-]

cyberdork@reddit

Thanks for the quick reply!

[-]

ethertype@reddit

Thunderbolt or USB4. USB C is just a physical interface.

Thunderbolt explicitly supports eGPUs. The USB4 specification allows for implementations which cannot transport PCIe traffic. No clue how common this is.

[-]

g-nice4liief@reddit

Sorry you're right i had a brainfart as i was doing multiple things at the same time lol. I meant USB4 indeed !

[-]

No_Afternoon_4260@reddit

Honestly any cheap ddr4 system does the trick as you don t need the cpu to be powerful or its ram to be fast

[-]

SomeOddCodeGuy@reddit

For the same price as that card, Intel just put out a $250 12GB video card that was getting pretty decent speeds. You could probably build a machine around it for another $200 if you got cheaper parts. So personally, I'd go that route before buying one of these for a home LLM machine.

[-]

TheDreamWoken@reddit

Nope it’s meant for embodied agents not for home use and it’s going to take a lot more effort it set up too.

[-]

ranoutofusernames__@reddit

That’s one thing about these, it’s a lot of work to get up and running

[-]

BusRevolutionary9893@reddit

For your home's autonomous robotic surveillance unit, yes.

[-]

siegevjorn@reddit

For local voice assistant it may be sufficient. Although the Maxwell architecture is pretty old, you may be able to load llama.cpp with 8B LLMs with reasonable quant. It has a 100 GB/s memory bandwidth, so 7.5GB model will give you bit more than 10 tokens/sec. Which is usable. You can add whisper.cpp on top of that to do voice recognition.

[-]

Radiant_Dog1937@reddit

It depends, it runs on 25 watts, which is the main draw that makes it better for edge computing in certain situations, it's not bad for 8 gb vram since it's an all-in-one package and I've seen it run llama 3.1 8B at around 30 ish t/s.

I don't know where you get a used 3090 for $250 like the guy above was mentioning and a mac mini is about $600 new.

[-]

VanVision@reddit

I've seen it run llama 3.1 8B at around 30 ish t/s.

What quant, and parameters were set to make this possible on 8Gb of vram?

[-]

Radiant_Dog1937@reddit

It was a review on youtube. The guy simply installed ollama after setting up the orin, downloaded llama 3.1 8b and it worked. I assume 4q, since I get that to work on my laptop gpu which is also 8gb.

[-]

nanobot_1000@reddit

The benchmarks were with MLC, not ollama (which was 65% the performance), but it's just another OpenAI-compliant endpoint you swap out

Making a little tool now that spits out the right docker-compose templates for spinning these up

But i agree, it's primarily for robotics and embedded vision and not to bother with aarch64+CUDA if you don't need to. Although our container automation be pretty built out by now, so if the tokens/sec/$ is good to you, you could literally scale these out. AGX Orin 64GB too, that you can host up to 70B on.

[-]

Radiant_Dog1937@reddit

I'm hoping someone might do a review that compares all of those MLC, Ollama, and Tensor-RT- LLM. For a single board computer these seem pretty cool for alot of applications where there was only a raspberry pi that did 1 t/s before.

[-]

nanobot_1000@reddit

Basically I am extending the scripts to benchmark different APIs through their open AI endpoints. Support in those servers has gotten better. I know from spot checking llama.cpp/ollama, it is a lot slower unfortunately. But fortunately the others have gotten easier to use and can just script the deployments of.

[-]

nanobot_1000@reddit

FYI here is the benchmark script currently https://github.com/dusty-nv/jetson-containers/tree/master/packages/llm/mlc

[-]

Small-Fall-6500@reddit

Probably up to 20T/s is achievable. Here are some hard numbers I found:

This comment/thread suggests 20T/s with a smaller model:

If he specified the params/quant, I missed it, but Dave Plummer got about 20t/s ... He runs ollama run llama3.2 which downloads 3b-instruct-q4_K_M ... a 3b quantized down to q4

The official Nvidia blog claims running llama 3.1 8b at nearly 20T/s, with INT4 quantization using MLC API.

I'm not sure what INT4 is, but it may just be rounding the numbers, which I think would be equivalent to Q4_0 GGUF, which is a very basic, outdated quantization. There may be better alternatives with similar speeds, but I don't know. 20T/s is pretty close to the theoretical max (with 102GB/s bandwidth) of 25T/s with a 4bit 8b model.

[-]

Small-Fall-6500@reddit

it's not bad for 8 gb vram since it's an all-in-one package and I've seen it run llama 3.1 8B at around 30 ish t/s.

It's 8GB of fast RAM, or 8GB of really slow VRAM. It is 102GB/s bandwidth, which is less than 1/3 of a 3060 12GB, for reference. DDR5 dual channel can be cloae to 100 GB/s (this post has some nice RAM data)

4 bit Lllama 8b is maybe 4GB, so the best possible speed is about 25T/s if the memory bandwidth is fully utilized, which is often not the case. Here are some hard numbers I could find:

This comment/thread suggests slower speeds with a smaller model:

If he specified the params/quant, I missed it, but Dave Plummer got about 20t/s ... He runs ollama run llama3.2 which downloads 3b-instruct-q4_K_M ... a 3b quantized down to q4

The official Nvidia blog claims running llama 3.1 8b at nearly 20T/s, with INT4 quantization using MLC API. I'm not sure what INT4 is, but it may just be rounding the numbers, which I think would be equivalent to Q4_0 GGUF, which is a very basic, outdated quantization. There may be better alternatives with similar speeds, but I don't know. 20T/s is pretty close to the theoretical max of 25T/s with a 4bit 8b model.

I would assume 20T/s is close to the best possible out of this system (for 4bit 8b), which isn't that bad.

[-]

Initial-Image-1015@reddit (OP)

llama 3.1 8B at around 30 ish t/s

That doesn't sound bad at all for a cheapish start! The mac mini is even less, only 500 with the student discount, but I have seen mixed reviews for inference speed.

[-]

MoffKalast@reddit

The mac mini has 16 GB vs 8 GB of this one though, and the mac has 120GB/s memory bandwidth compared to this at 102 GB/s too, so they should be very comparable in speed actually.

Remember this is shared memory for both, which includes the OS and anything else you'll be running, which quickly drops that 8 GB down to 7 GB, and an 8B model at 4 bits takes about 6 GB to load depending on context. So you don't have much room to maneuver and might need swap to keep it from locking up on occasion.

[-]

Initial-Image-1015@reddit (OP)

Ok, thank you for the explanation. Makes a lot of sense.

[-]

Nyghtbynger@reddit

The true question is : jetson orin nano super or RX 7600XT 16G ?

[-]

MikePounce@reddit

To have a good time you need CUDA, so nvidia.

[-]

Opteron67@reddit

rocm

[-]

MoffKalast@reddit

He that is without a working driver among you, let him first cast a rocm at AMD.

[-]

Nyghtbynger@reddit

Are you talking about windows or linux ?

Because having used both NVIDIA and now AMD (started recently) windows is very straightforward, but having used linux for server deployments it doesn't seem that horrendous

[-]

Tacx79@reddit

No, at best you get the same inference speed as with average speed ddr5 dual channel

[-]

powerofnope@reddit

Super depends what you want to do. Just toy around - of course. Get productivity out of it hell no.

[-]

rinaldo23@reddit

I think Jetsons only make sense for battery powered applications.

[-]

scottix@reddit

It's an edge device so you can run small models and with relatively low power. Think of like a car or any control device. Unless you want to develop in that space, I don't see much use outside of that. Your not going to be running any big models for sure.

[-]

a_beautiful_rhind@reddit

It's 8gb so you'll have to use some really small models.