Llama 3.1 70B handles German e-commerce queries surprisingly well — multi-agent shopping assistant results

Posted by m3m3o@reddit | LocalLLaMA | View on Reddit | 10 comments

I built a multi-agent shopping assistant using NVIDIA's retail blueprint + Shopware 6 (European e-commerce platform). Wanted to share some observations about Llama 3.1 70B Instruct in a multilingual context.

Setup: 5 LangGraph agents, Llama 3.1 70B via NVIDIA Cloud API (integrate.api.nvidia.com), Milvus vector search, NeMo Guardrails.

Multilingual findings:

Intent classification works cross-language. The Planner agent uses an English routing prompt but correctly classifies German queries like "Zeig mir rote Kleider unter 100 Franken" (show me red dresses under 100 CHF). No German routing prompt needed.

Chatter prompt needs explicit bilingual instruction. Without it, the model responds in whatever language the system prompt is in, ignoring the query language. Adding "Respond in the same language the customer used" fixed this.

NeMo Guardrails are English-tuned. German fashion terms triggered false positives. "Killer-Heels" (common German fashion term) got flagged as unsafe. If you're deploying for non-English markets, plan for guardrails calibration.

Self-hosting question: For Swiss data residency (DSG compliance), you'd need self-hosted NIMs instead of NVIDIA Cloud API. H100 GPUs run ~$2-4/hr per GPU on Lambda/Vast.ai. Has anyone here self-hosted the NVIDIA NIM containers for Llama 3.1 70B? Curious about real-world RAM/VRAM requirements.

Full write-up: https://mehmetgoekce.substack.com/p/i-connected-nvidias-multi-agent-shopping

[-]

MustBeSomethingThere@reddit

2 years old model

[-]

Impossible_Art9151@reddit

2 years AI evolution => llama3.1 is not iron age, it is stone age
I really wonder where from and why those many llama3.1 postings do come from?
Does anyone want to push those models from obsolescence?
Does anyone want to propagate US models as an alternative to chinese models?

Please - to all llama3.1 integrators - answer the question: Why aren't you using models from 2026?
They are lightyears better in any aspect. Language understanding, quality and performance.

[-]

EffectiveCeilingFan@reddit

It’s because this is AI-generated. OP did none of their own research, everything went through ChatGPT. Most chatbots are around 1-2 years out of date. Hence, 1-2 year old models being used everywhere. Notice how, when confronted, OP still didn’t switch to a recent model, they switched to LLaMa 4. OP clearly has no clue about anything related to local AI.

[-]

Impossible_Art9151@reddit

Yes, maybe?
OP uses a closed nvidia framework and nvidia is pretty slow in updating.

[-]

m3m3o@reddit (OP)

Fair point — Llama 3.1 is indeed not the latest. The reason is simple: NVIDIA's blueprint ships with Llama 3.1 70B as the default, and the focus of this project was the Shopware integration and multi-agent architecture, not model benchmarking.
That said, the architecture is model-agnostic. The LLM call is a single config line pointing to the NVIDIA NIM endpoint. Swapping to Llama 3.3 or any newer model on build.nvidia.com is a one-line change.

The integration learnings (Store API quirks, bilingual prompts, guardrails calibration) transfer regardless of which model sits behind the Chatter agent.
If anyone has tested newer models with NeMo Guardrails in a non-English context, I'd genuinely be curious how false positive rates compare.

[-]

MelodicRecognition7@reddit

pay $2-4/hr to search red dresses under 100 CHF

do we already live in an idiocracy or not yet?

[-]

Impossible_Art9151@reddit

Self-hosting question: For Swiss data residency (DSG compliance), you'd need self-hosted NIMs instead of NVIDIA Cloud API. H100 GPUs run \~$2-4/hr per GPU on Lambda/Vast.ai. Has anyone here self-hosted the NVIDIA NIM containers for Llama 3.1 70B? Curious about real-world RAM/VRAM requirements.

siehe auch meinen Kommentar unten zur Nutzung von llama3.1: Wie viele concurrent userzugriffe habt ihr denn in den Stoßzeiten? Unabhängig von der self hosting Frage ist das uralt llama3.1:70b als dense model irgendwo grob zwischen 20-fach bis 50-fach unperformanter als ein modernes zeitgemäßes moe.
Ich wette ein qwen3.5-35b-a3b schlägt den Dinosurier mit seinen aktiven 3b um Welten!

Zur Auslegungsfrage: Angenommen ihr wollt auf 5 cc Zugriffe hin planen. Eine strix halo oder eine dgx schaffen das locker nebenbei - wenn die Antwort 3 Sekunden warten darf.
Die Lastgrenze liegt je nach Kontextgröße aus dem Bauch heraus bei vielleicht 10-25 usern.
Eine einzelne H100 dürfte größere dreistellige Nutzerzahlen bedienen können. Für eure Zwecke wahrscheinlich völlig oversized. Gerne direkter Kontakt für Fragen zum self hosting

Answered in German with the recommendation to switch to a modern model

[-]

m3m3o@reddit (OP)

Thanks for the detailed breakdown — really helpful context on the hardware side.
To answer your question: this was a demo/integration project, not a production deployment. Concurrent users during testing: just me. The goal was proving the Shopware integration pattern works (Store API → CSV → Milvus → agent pipeline), not load testing the LLM.

You're right that Llama 3.1 70B is overkill and outdated for this use case. The blueprint defaults to it, and I kept it to focus on the integration layer. The architecture is model-agnostic — swapping the NIM endpoint is a config change, so your suggestion of a smaller, modern MoE model makes a lot of sense for actual deployment.

The Qwen 3.5 35B-A3B suggestion is interesting — have you tested it with multilingual (German) queries? The NeMo Guardrails false positive issue I described might be less of a problem with a model that handles German natively rather than as a secondary language.

Appreciate the offer for self-hosting advice. Might take you up on that when we move this toward production for a client.

[-]

Impossible_Art9151@reddit

Good to hear that you are not planing with the old llama.
Nevertheless I recommend to develop with new models, since any newer model is closer to any actual behaviour since "intelligence" has increased that much.
Your guardrail expericence will be a different one.

Your question: We are using qwen models beside others mostly in German.
Apart from qwen, the mistral models might be pretty suitable for your use case.
Take them into account as well. I would even favour them for your use case.

I do not know the NVIDIA Cloud API specs. Is it an openly supported standard? Generally I would favour developing against the most common API which is the openAI interface, since it allows you switching to .... whatever comes.

[-]

m3m3o@reddit (OP)

Update: Just upgraded the demo to Llama 4 Maverick (meta/llama-4-maverick-17b-128e-instruct). Fair point about the model age — the blueprint shipped with 3.1 70B and I kept it to focus on the integration layer, but there's no reason to stay on it.
Maverick is MoE (400B params, 17B active per token) so it should actually be more efficient for self-hosting too. German queries work out of the box, same config swap as described above.
Repo updated: https://github.com/MehmetGoekce/nvidia-shopware-assistant