New embedding model "Qwen3-Embedding-0.6B-GGUF" just dropped.
Posted by Proto_Particle@reddit | LocalLLaMA | View on Reddit | 108 comments
Anyone tested it yet?
Posted by Proto_Particle@reddit | LocalLLaMA | View on Reddit | 108 comments
Anyone tested it yet?
davewolfs@reddit
It was release an hour ago. Nobody has tested it yet.
BananaPeaches3@reddit
Then what is this from 2 days ago?: https://ollama.com/ZimaBlueAI/Qwen3-Embedding-0.6B
Chromix_@reddit
Well, it works. I wonder what test OP is looking for aside from the published benchmark results.
llama-embedding -m Qwen3-Embedding-0.6B_f16.gguf -ngl 99 --embd-output-format "json+" --embd-separator "<#sep#>" -p "Llamas eat bananas<#sep#>Llamas in pyjamas<#sep#>A bowl of fruit salad<#sep#>A sleeping dress" --pooling last --embd-normalize -1
You can clearly see that the model considers llamas eating bananas more similar to a bowl of fruit salad, than to llamas in pyjamas - which is closer to the sleeping dress.
When testing the same with the less capable snowflake-arctic-embed it puts the two llamas way closer together, but doesn't yield such a strong distinction between the dissimilar cases like Qwen.
socamerdirmim@reddit
What Embedding model you recommend? I am searching for a good one for Silly tavern RP games, currently I am using the snowflake-arctic-embed-l-v2.0.
Chromix_@reddit
Just use the new Qwen3 0.6B as a free upgrade. You'll get even better results with their 8B embedding, but you probably don't have enough similar RP data there for this to make a difference.
socamerdirmim@reddit
will try it. I have millions of token in chat history.
Chromix_@reddit
In that case I'd be interested to hear if you can see a qualitative difference between your current, the 0.6B and the 8B embedding.
FailingUpAllDay@reddit
This is the quality content I come here for. But I'm concerned that "llamas eating bananas" being closer to "fruit salad" than to "llamas in pyjamas" reveals a deeper truth about the model's worldview.
It clearly sees llamas as food-oriented creatures rather than fashion-forward ones. This embedding model has chosen violence against the entire Llamas in Pyjamas franchise.
Time to fine-tune on episodes 1-52 to correct this bias. /s
FourtyMichaelMichael@reddit
OK STOP.
I just want everyone right now, including OP here to think about these words in their own contexts up to but less than two years ago.
Historically, this is the ranting of a lunatic.
FailingUpAllDay@reddit
Wait until we're arguing about whether GPT-7 properly understands the socioeconomic implications of alpaca sweater vests.
Chromix_@reddit
Yes, and you know what's even worse? It sees us humans in almost the same way, according to the similarity matrix. Feel free to experiment.
slayyou2@reddit
Hey could you reupload the model somewhere? They took it down
Chromix_@reddit
The link still works for me. Same for the 8B embedding. Maybe it was just briefly gone?
slayyou2@reddit
Yea it's back now thanks anyway
Xamanthas@reddit
He is outsourcing you thinking for him. Thanks Deepseek effect for bringing in mouthbreathers + look at the account, never posted EVER before.
JollyJoker3@reddit
Lots of achievements and five year old account. Do bot farms buy or hack used accounts?
vibjelo@reddit
Might as well ask "Did reddit kill 3rd-party clients?"
lighthawk16@reddit
Did they? I use mine every day...
vibjelo@reddit
Is your client still being updated or has it maybe been unmaintained for like 3 years, like most others?
It's great that it still works for you, and I'm guessing you had to patch it yourself just because reddit tried to kill it.
lighthawk16@reddit
Updated on Monday. No patch, just needed an API key.
vibjelo@reddit
For curiosities sake, what client is this?
lighthawk16@reddit
Slide. As mentioned earlier.
vibjelo@reddit
Is not this one? https://github.com/Haptic-Apps/Slide
Last commit was in Nov 25, 2022, seems there are some more updated forks, but I think it's safe to say that Reddit with their changes did/tried to kill clients like Slide
lighthawk16@reddit
https://play.google.com/store/apps/details?id=me.edgan.redditslide
MrBIMC@reddit
I'm still on sync for reddit. Had to patch it for it to continue working though.
lighthawk16@reddit
Yup using Slide for Reddit here with my own API key.
dillon-nyc@reddit
I know my account looks like that.
I hit a span of long term unemployment, and it was apparent from one interaction that my reddit comment history had been part of their background check.
This account was always linked to my actual identity, because for a while that was helpful for me professionally (I used to answer Ethereum questions very early in the history of that).
starfries@reddit
How did you know that they looked at your comment history?
dillon-nyc@reddit
They mentioned something about etherdelta.
starfries@reddit
Ahh okay, thanks for satisfying my curiosity
bornfree4ever@reddit
why would you give them your reddit user name
terminoid_@reddit
just a heads-up, the tokenizer was just updated right now on the safetensors release, so old GGUFs are prolly busted
shifty21@reddit
The link 404's for me...
Weird.
Chromix_@reddit
Looks broken for me with latest llama.cpp.
Incorrect result with Qwen3 embedding:
llama-embedding -m Qwen3-Embedding-0.6B_f16.gguf -ngl 99 --embd-output-format "json+" --embd-separator "<#sep#>" -p "Bananas<#sep#>Apples<#sep#>Office"
Correct result with Snowflake-arctic embedding:
llama-embedding -m snowflake-arctic-embed-l.i1-Q5_K_M.gguf -ngl 99 --embd-output-format "json+" --embd-separator "<#sep#>" -p "Bananas<#sep#>Apples<#sep#>Office"
KvAk_AKPlaysYT@reddit
lol
madaradess007@reddit
can anyone give advice on how should i use it?
i got deepseek generating a sci-fi video game design documents on repeat (like 180-200 of them overnight), qwen3 then goes and compiles them in batches of 3, then compiles those compilations and saves a final result
maybe i'm dumb and this is not as efficient as it could be, please advise
madaradess007@reddit
damn son, you guys are mean
Echo9Zulu-@reddit
Sounds like a synthetic data pipeline. Just use your own comment in a prompt and mention you saw an embedding model and want to take your setup further by adding a retreival component
_long_hair_@reddit
Lightrag qwen3
TristarHeater@reddit
bit off topic, but does anyone know if there's been developments in image+text embedding models? Or is openai clip still best
Remarkable-Law9287@reddit
for document retrieval or image search
TristarHeater@reddit
image search by text query
kareemkobo@reddit
there is the DSE, colpali family, jinai models (clip and reranker-m0) and much more!
TristarHeater@reddit
thx for the recommendations :) i'm looking for photograph embeddings though not documents.
kareemkobo@reddit
try the clip model from jinai
TristarHeater@reddit
oh yea that looks promising! i will try it out
Barry_Jumps@reddit
Love this. Love that it allows user defined dimensions as well. But pls someone smarter than me explain the advantage of defining dimension with qwen rather than just truncating? Have done some of these experiments with sentence transformers and mxbread models but I can’t figure out what’s actually happening under the hood
foldl-li@reddit
Both embedding and Reranker are supported by chatllm.cpp now.
trusty20@reddit
Can someone shed some light on the real difference between a regular model and an embedding model. I know the intention, but I don't fully grasp why a specialist model is needed for embedding; I thought that generating text vectors etc was just what any model does in general, and that regular models simply have a final pipeline to convert the vectors back to plain text.
Where my understanding seems to be wrong to me, is that tools like AnythingLLM allow you to use regular models for embedding via Ollama. I don't see any obvious glitches when doing so, not sure they perform well, but it seems to work?
So if a regular model can be used in the role as embedding model in a workflow, what is the reason for using a model specifically intended for embedding? And the million dollar question: HOW can a specialized embedding model generate vectors compatible with different larger models? Like surely an embedding model made in 2023 is not going to work with a model from a different family trained in 2025 with new techniques and datasets? Or are vectors somehow universal / objective?
FailingUpAllDay@reddit
Think of it this way: Regular LLMs are like that friend who won't shut up - you ask them anything and they'll generate a whole essay. Embedding models are like that friend who just points - they don't generate text, they just tell you "this thing is similar to that thing."
The key difference is the output layer. LLMs have a vocabulary-sized output that predicts next tokens. Embedding models output a fixed-size vector (like 1024 dimensions) that represents the meaning of your entire input in mathematical space.
You can use regular models for embeddings (by grabbing their hidden states), but it's like using a Ferrari to deliver pizza - technically works, but you're wasting resources and it wasn't optimized for that job. Embedding models are trained specifically to make similar things have similar vectors, which is why a 0.6B model can outperform much larger ones at this specific task.
forgotmyolduserinfo@reddit
Fantastic explanation, you should be at the top
FailingUpAllDay@reddit
Thank you! I try really hard :)
Canucking778@reddit
Thank you.
Logical_Divide_3595@reddit
> specifically designed for text embedding and ranking tasks
Used for RAG system
anilozlu@reddit
Regular models (actually all transformer models) output embeddings that correspond to input tokens. So that means one embedding vector for each token, whereas you would want one embedding vector for the whole input (sentence or chunk of document). Embedding models have a text embedding vector layer at the end, that takes in the token embedding vectors and create a single text embedding, instead of the usual token generation layer.
You can use a regular model to create text embeddings by averaging the token embeddings or just taking only the final token embedding, but it shouldn't be nearly as good as a tuned text embedding model.
ChristopherCreutzig@reddit
Some model architectures (like BERT and its descendants) start with a special token (traditionally `[CLS]` as the first token, but the text version is completely irrelevant) and use the embedding vector of that token in the output as the document embedding.
That tends to work better in encoder models (again, like BERT) that aren't using causal attention (like a “generate next token” transformer).
anilozlu@reddit
ModernBERT uses [CLS] embedding + average of all other token embeddings (+ means concatenate in this case) surprisingly
ChristopherCreutzig@reddit
Sure. One of the options used there for pooling is to return the
[CLS]
embedding.1ncehost@reddit
This isn't entirely true because those token embeddings are used to produce a hidden state which is equivalent to what the embedding algos do. The final hidden state that is used to create the logits vector represents the latent space of the entire input fed to the llm similar to what an embedding algo represents.
anilozlu@reddit
I meant hidden states by "embeddings that correspond to each input token" to try to keep it simple
BogaSchwifty@reddit
I’m not an expert here, but from my understanding a normal LLM is a function f that takes as input a context and a token, then it outputs the next token over and over until a termination condition is met. An embedding model vectorizes text. The main application of this model is document retrieval, where you “RAG” (vectorize) multiple documents, vectorize your search prompt, apply a cosine similarity between your vectorized prompt and the vectorized documents, and sort in desc order the results, the higher the score the more relevant a document (or chunk of text) is to your search prompt. I hope that helps.
WitAndWonder@reddit
Embedding models go through a finetune on a very particular kind of pattern / output (RAG embeddings.) Now you could technically do it with larger models, but why would you? It's massive overkill as the performance really drops off after the 7B mark, and running a larger model to handle it would just be throwing away resources.
1ncehost@reddit
Its as simple as embedding models have a latent space that is optimized for vector similarity while the latent space of an LLM is optimized for predicting the next token.
Latent vectors are not universal, as they have different sizes and dimensional meaning in different models, but have been shown to be universally transformable by a team recently (dont ask me how or why though).
If you want a compatible latent vector to an LLM just use the latent vectors it produces. You don't need an embedding model for that.
ab2377@reddit
i fed your question to chatgpt & deepseek, got great answers, you should try too.
Kooshi_Govno@reddit
I just copied your comment into claude, cus I didn't know well enough to answer:
Your intuition is correct! Here's what's happening:
Regular vs Embedding Models
Regular models CAN do embeddings - tools like AnythingLLM just extract the internal vectors instead of letting the model finish generating text. This works fine.
Specialized embedding models exist because: - They're trained specifically to make similar texts have similar vectors (not just predict next words) - They're smaller, faster, and often produce better semantic representations - They're optimized for the specific task of understanding meaning
The Compatibility Insight
Embeddings from different models are NOT directly compatible. But they don't need to be!
In RAG systems: 1. Embedding model finds relevant documents using vector similarity 2. Language model receives those documents as plain text
The "compatibility" happens at the text level. A 2023 embedding model can absolutely work with a 2025 language model - the embedding model just finds the right text chunks, then hands that text to whatever generation model you're using.
This is why you can mix and match models in RAG pipelines. The embedding model's job is just retrieval; the language model processes the retrieved text like any other input.
So specialized embedding models aren't required, but they're usually better and more efficient at the retrieval task.
Flashy_Management962@reddit
I get this error when processing big chunks, does anybody know how to fix this? "Out of range float values are not JSON compliant: nan"
Calcidiol@reddit
I'm just guessing, but if you're using the GGUF Q8 / F16 model then potentially the weights have very significantly less dynamic range than the native BF16 data type model.
Maybe that itself can be a problem and / or maybe it can influence the activation / calculation result data type to also have less precision / accuracy / range than if BF16 or FP32 was used in the key parts of the calculation.
It's plausible at first thought that the big chunks literally accumulate more and more data into a calculation result (proportional to the large chunk size you use) and as more data accumulates the risk of overflow or underflow producing a NaN is higher particularly if using a lower precision / accuracy / range data type somewhere in the calculations.
Maybe see if the same result occurs whether you use Q8, F16, BF16 format model weights and also if you do not quantize the activations but keep them BF16 or whatever is relevant for your configuration.
EstebanGee@reddit
Maybe a dumb question, but why is a rag better than say an elastic search tool query?
terminoid_@reddit
it's actually not uncommon to combine BM25 with vectors
WitAndWonder@reddit
Semantic search (RAG) is focused on the meaning, rather than any arbitrary keywords, collections of letters, phrases, or whatever else that specifically is present in your fields. So a RAG system will be able to search for 'heat', for instance, and even if you have zero document with the word heat, it will still pull up, with varying degrees of similarity/certainty, "thermal", "sun", "fire", "flame", "oven", "warmth". And it gets even better than that since it will consider more than just the specific word, but the actual meanings of the sentences. So 'not warm' will be significantly lower than 'warm', and mentions of sun-dried raisins would likely have very little similarity with a good embedding model, whereas a 'sunny day' may yield high similarity.
No_Committee_7655@reddit
An elastic search tool query is RAG.
RAG stands for retrieval augmented generation. If you are retrieving sources not featured in the training data to give an LLM additional context data to answer a query that is RAG as you are doing information retrieval.
ThePixelHunter@reddit
So did anybody save it?
ThePixelHunter@reddit
Aaaaannnd it's back.
Competitive_Pass_855@reddit
+1, so weird that they just undid their release
ThePixelHunter@reddit
This is common. Either it was accidentally public, or it was too early.
terminoid_@reddit
nice, this is the #1 thing i wanted when the 0.6B dropped
shibe5@reddit
Aaand it's gone.
ahmetegesel@reddit
Moved to different collection https://huggingface.co/collections/Qwen/qwen3-embedding-6841b2055b99c44d9a4c371f
gcavalcante8808@reddit
Yeah ... I guess my bge-m3 comparison will be postponed.
Key_Medium5886@reddit
There are several embedding models that I can't run in AnythingLLM.
At first, I thought it was a problem on my part, but I've noticed that only the models that LM Studio detects as purely embedding models (not instruct) work.
Therefore, this one doesn't work for me, whether I run it from Ollama, Llama.cpp, or LM Studio... at first, it seems like it does, but it seems that, at least with AnythingLLM, they don't quite work.
Does anyone know where the problem lies?
Asleep-Ratio7535@reddit
How would you test RAG?
tucnak@reddit
PageRank
istinetz_@reddit
another idea is to measure the distances between 3 snippets, 2 from the same document and 1 from a random document. Ideally you want your embedder to have low distance between the 2 snippets from the same document, and high distance between them and the third one. Of course, averaged over a large sample.
BogaSchwifty@reddit
Build a vectorbase consisting of multiple documents, say Wikipedia. Then, test the vectorbase by asking multiple different prompts (you can have an LLM generate the prompts), if the vectorbase selects the most relevant articles to your search prompt (you can have the LLM decide that), then your model is good.
noiserr@reddit
This is a really cool feature.
Ortho-BenzoPhenone@reddit
it is mentioned that they are also launching the 4b and 8b versions. and also text re-rankers. i am not really sure about what these re-rankers are. whether these are embedding similarity based or transformer based (if that even exists), but still quite cool to see.
they have also defeated gemini embeddings (which was the SOTA) till now, and both the 4b and 8b models beat it. kudos to the team!!
silenceimpaired@reddit
Is this for RAG… and/or what else?
Ortho-BenzoPhenone@reddit
RAG, text classification, or anything you need to do with embeddings. re-rankers are things that will rank some pieces of text based on a given question/query. like re-ranking search results according to relevance.
silenceimpaired@reddit
Cool thanks for expanding my knowledge.
Proto_Particle@reddit (OP)
Qwen team just published back this and all the other embedding and reranking models including safetensors.
FailingUpAllDay@reddit
"Qwen3-Embedding-0.6B-GGUF" just dropped... and then embedded itself so deeply it disappeared from our reality.
Guess it works too well. Now we need a retrieval model just to find the embedding model. 🤷♂️
Edit: In all seriousness though, classic Qwen move - drop a banger that dominates benchmarks at 1/10th the size, then yeet it from existence before anyone can test if it actually runs on their 3090. They're just flexing on us at this point.
DeepInEvil@reddit
The main problem with embedding models is it doesn't support negations and stuff. Hope with these class of models that problem is somewhat solved
m18coppola@reddit
just multiply the query vector by -1
Craftkorb@reddit
Their links to GitHub and blog post are broken. Looks really interesting though, would have to do some checks myself. Multilingual embeddings with MLK is actually pretty hard. Looks like they don't support binary representation though.
shifty21@reddit
The link OP posted 404s for me.
Craftkorb@reddit
Interesting, it does not fit me too. They must have published it by accident.
gcavalcante8808@reddit
Nice, 1024 dimensions. Time to test it against bge-m3
evnix@reddit
yeah would love to see this, bge-m3 has been my goto so far
10minOfNamingMyAcc@reddit
Tried to load it in Koboldcpp and only got out of memory errors (even with 10GB free VRAM.) Is it compatible?
MushroomGecko@reddit
I spent more time than I'd like to admit yesterday on MTEB trying to find the perfect embedding model for my VDB for a RAG app I'm building for a client. Thanks, Qwen. The search is over. Dominating the competition at a fraction of the size (in typical Qwen fashion)
Loose_Race908@reddit
Those Benchmarks for the 4B and 8B param models 👀
Illustrious-Dot-6888@reddit
Yes,Barry Allen has tested it a few times already.
GortKlaatu_@reddit
Getting some flash attention.
Carrasco_Santo@reddit
From the description, the 0.6 model seems too good to be true, it even seems much superior to the TinyLlama.
balerion20@reddit
I was really waiting for new multilingual embedding model so this will be nice to test for our rag project
Agitated-Doughnut994@reddit
Qwen Team! Thank you!
Leflakk@reddit
Qwen teams strike again!
pas_possible@reddit
Can wait to give it a try, I hope it's good