New embedding model "Qwen3-Embedding-0.6B-GGUF" just dropped.

[-]

davewolfs@reddit

It was release an hour ago. Nobody has tested it yet.

[-]

BananaPeaches3@reddit

Then what is this from 2 days ago?: https://ollama.com/ZimaBlueAI/Qwen3-Embedding-0.6B

[-]

Chromix_@reddit

Well, it works. I wonder what test OP is looking for aside from the published benchmark results.

llama-embedding -m Qwen3-Embedding-0.6B_f16.gguf -ngl 99 --embd-output-format "json+" --embd-separator "<#sep#>" -p "Llamas eat bananas<#sep#>Llamas in pyjamas<#sep#>A bowl of fruit salad<#sep#>A sleeping dress" --pooling last --embd-normalize -1

"cosineSimilarity": [
[ 1.00, 0.22, 0.46, 0.15 ], (Llamas eat bananas)
[ 0.22, 1.00, 0.28, 0.59 ], (Llamas in pyjamas)
[ 0.46, 0.28, 1.00, 0.33 ], (A bowl of fruit salad)
[ 0.15, 0.59, 0.33, 1.00 ], (A sleeping dress)
]

You can clearly see that the model considers llamas eating bananas more similar to a bowl of fruit salad, than to llamas in pyjamas - which is closer to the sleeping dress.

When testing the same with the less capable snowflake-arctic-embed it puts the two llamas way closer together, but doesn't yield such a strong distinction between the dissimilar cases like Qwen.

"cosineSimilarity": [
[ 1.00, 0.79, 0.69, 0.66 ],
[ 0.79, 1.00, 0.74, 0.82 ],
[ 0.69, 0.74, 1.00, 0.81 ],
[ 0.66, 0.82, 0.81, 1.00 ]
]

[-]

socamerdirmim@reddit

What Embedding model you recommend? I am searching for a good one for Silly tavern RP games, currently I am using the snowflake-arctic-embed-l-v2.0.

[-]

Chromix_@reddit

Just use the new Qwen3 0.6B as a free upgrade. You'll get even better results with their 8B embedding, but you probably don't have enough similar RP data there for this to make a difference.

[-]

socamerdirmim@reddit

will try it. I have millions of token in chat history.

[-]

Chromix_@reddit

In that case I'd be interested to hear if you can see a qualitative difference between your current, the 0.6B and the 8B embedding.

[-]

FailingUpAllDay@reddit

This is the quality content I come here for. But I'm concerned that "llamas eating bananas" being closer to "fruit salad" than to "llamas in pyjamas" reveals a deeper truth about the model's worldview.

It clearly sees llamas as food-oriented creatures rather than fashion-forward ones. This embedding model has chosen violence against the entire Llamas in Pyjamas franchise.

Time to fine-tune on episodes 1-52 to correct this bias. /s

[-]

FourtyMichaelMichael@reddit

But I'm concerned that "llamas eating bananas" being closer to "fruit salad" than to "llamas in pyjamas" reveals a deeper truth about the model's worldview.

It clearly sees llamas as food-oriented creatures rather than fashion-forward ones. This embedding model has chosen violence against the entire Llamas in Pyjamas franchise.

OK STOP.

I just want everyone right now, including OP here to think about these words in their own contexts up to but less than two years ago.

Historically, this is the ranting of a lunatic.

[-]

FailingUpAllDay@reddit

Wait until we're arguing about whether GPT-7 properly understands the socioeconomic implications of alpaca sweater vests.

[-]

Chromix_@reddit

It clearly sees llamas as food-oriented creatures rather than fashion-forward ones.

Yes, and you know what's even worse? It sees us humans in almost the same way, according to the similarity matrix. Feel free to experiment.

[-]

slayyou2@reddit

Hey could you reupload the model somewhere? They took it down

[-]

Chromix_@reddit

The link still works for me. Same for the 8B embedding. Maybe it was just briefly gone?

[-]

slayyou2@reddit

Yea it's back now thanks anyway

[-]

Xamanthas@reddit

He is outsourcing you thinking for him. Thanks Deepseek effect for bringing in mouthbreathers + look at the account, never posted EVER before.

[-]

JollyJoker3@reddit

Lots of achievements and five year old account. Do bot farms buy or hack used accounts?

[-]

vibjelo@reddit

Do bot farms buy or hack used accounts?

Might as well ask "Did reddit kill 3rd-party clients?"

[-]

lighthawk16@reddit

Did they? I use mine every day...

[-]

vibjelo@reddit

Is your client still being updated or has it maybe been unmaintained for like 3 years, like most others?

It's great that it still works for you, and I'm guessing you had to patch it yourself just because reddit tried to kill it.

[-]

lighthawk16@reddit

Updated on Monday. No patch, just needed an API key.

[-]

vibjelo@reddit

For curiosities sake, what client is this?

[-]

lighthawk16@reddit

Slide. As mentioned earlier.

[-]

vibjelo@reddit

Is not this one? https://github.com/Haptic-Apps/Slide

Last commit was in Nov 25, 2022, seems there are some more updated forks, but I think it's safe to say that Reddit with their changes did/tried to kill clients like Slide

[-]

lighthawk16@reddit

https://play.google.com/store/apps/details?id=me.edgan.redditslide

[-]

MrBIMC@reddit

I'm still on sync for reddit. Had to patch it for it to continue working though.

[-]

lighthawk16@reddit

Yup using Slide for Reddit here with my own API key.

[-]

dillon-nyc@reddit

I know my account looks like that.

I hit a span of long term unemployment, and it was apparent from one interaction that my reddit comment history had been part of their background check.

This account was always linked to my actual identity, because for a while that was helpful for me professionally (I used to answer Ethereum questions very early in the history of that).

[-]

starfries@reddit

How did you know that they looked at your comment history?

[-]

dillon-nyc@reddit

They mentioned something about etherdelta.

[-]

starfries@reddit

Ahh okay, thanks for satisfying my curiosity

[-]

bornfree4ever@reddit

why would you give them your reddit user name

[-]

terminoid_@reddit

just a heads-up, the tokenizer was just updated right now on the safetensors release, so old GGUFs are prolly busted

[-]

shifty21@reddit

The link 404's for me...

Weird.

[-]

Chromix_@reddit

Looks broken for me with latest llama.cpp.

Incorrect result with Qwen3 embedding:

llama-embedding -m Qwen3-Embedding-0.6B_f16.gguf -ngl 99 --embd-output-format "json+" --embd-separator "<#sep#>" -p "Bananas<#sep#>Apples<#sep#>Office"

"cosineSimilarity": [
[ 1.00, 0.27, 0.93, 0.29, 0.92 ],
[ 0.27, 1.00, 0.25, 0.68, 0.28 ],
[ 0.93, 0.25, 1.00, 0.26, 0.93 ],
[ 0.29, 0.68, 0.26, 1.00, 0.28 ],
[ 0.92, 0.28, 0.93, 0.28, 1.00 ]
]

Correct result with Snowflake-arctic embedding:

llama-embedding -m snowflake-arctic-embed-l.i1-Q5_K_M.gguf -ngl 99 --embd-output-format "json+" --embd-separator "<#sep#>" -p "Bananas<#sep#>Apples<#sep#>Office"

"cosineSimilarity": [
[ 1.00, 0.84, 0.77 ],
[ 0.84, 1.00, 0.77 ],
[ 0.77, 0.77, 1.00 ]
]

[-]

KvAk_AKPlaysYT@reddit

lol

[-]

madaradess007@reddit

can anyone give advice on how should i use it?
i got deepseek generating a sci-fi video game design documents on repeat (like 180-200 of them overnight), qwen3 then goes and compiles them in batches of 3, then compiles those compilations and saves a final result
maybe i'm dumb and this is not as efficient as it could be, please advise

[-]

madaradess007@reddit

damn son, you guys are mean

[-]

Echo9Zulu-@reddit

Sounds like a synthetic data pipeline. Just use your own comment in a prompt and mention you saw an embedding model and want to take your setup further by adding a retreival component

[-]

_long_hair_@reddit

Lightrag qwen3

[-]

TristarHeater@reddit

bit off topic, but does anyone know if there's been developments in image+text embedding models? Or is openai clip still best

[-]

Remarkable-Law9287@reddit

for document retrieval or image search

[-]

TristarHeater@reddit

image search by text query

[-]

kareemkobo@reddit

there is the DSE, colpali family, jinai models (clip and reranker-m0) and much more!

[-]

TristarHeater@reddit

thx for the recommendations :) i'm looking for photograph embeddings though not documents.

[-]

kareemkobo@reddit

try the clip model from jinai

[-]

TristarHeater@reddit

oh yea that looks promising! i will try it out

[-]

Barry_Jumps@reddit

Love this. Love that it allows user defined dimensions as well. But pls someone smarter than me explain the advantage of defining dimension with qwen rather than just truncating? Have done some of these experiments with sentence transformers and mxbread models but I can’t figure out what’s actually happening under the hood

[-]

foldl-li@reddit

Both embedding and Reranker are supported by chatllm.cpp now.

[-]

trusty20@reddit

Can someone shed some light on the real difference between a regular model and an embedding model. I know the intention, but I don't fully grasp why a specialist model is needed for embedding; I thought that generating text vectors etc was just what any model does in general, and that regular models simply have a final pipeline to convert the vectors back to plain text.

Where my understanding seems to be wrong to me, is that tools like AnythingLLM allow you to use regular models for embedding via Ollama. I don't see any obvious glitches when doing so, not sure they perform well, but it seems to work?

So if a regular model can be used in the role as embedding model in a workflow, what is the reason for using a model specifically intended for embedding? And the million dollar question: HOW can a specialized embedding model generate vectors compatible with different larger models? Like surely an embedding model made in 2023 is not going to work with a model from a different family trained in 2025 with new techniques and datasets? Or are vectors somehow universal / objective?

[-]

FailingUpAllDay@reddit

Think of it this way: Regular LLMs are like that friend who won't shut up - you ask them anything and they'll generate a whole essay. Embedding models are like that friend who just points - they don't generate text, they just tell you "this thing is similar to that thing."

The key difference is the output layer. LLMs have a vocabulary-sized output that predicts next tokens. Embedding models output a fixed-size vector (like 1024 dimensions) that represents the meaning of your entire input in mathematical space.

You can use regular models for embeddings (by grabbing their hidden states), but it's like using a Ferrari to deliver pizza - technically works, but you're wasting resources and it wasn't optimized for that job. Embedding models are trained specifically to make similar things have similar vectors, which is why a 0.6B model can outperform much larger ones at this specific task.

[-]

forgotmyolduserinfo@reddit

Fantastic explanation, you should be at the top

[-]

FailingUpAllDay@reddit

Thank you! I try really hard :)

[-]

Canucking778@reddit

Thank you.

[-]

Logical_Divide_3595@reddit

> specifically designed for text embedding and ranking tasks

Used for RAG system

[-]

anilozlu@reddit

Regular models (actually all transformer models) output embeddings that correspond to input tokens. So that means one embedding vector for each token, whereas you would want one embedding vector for the whole input (sentence or chunk of document). Embedding models have a text embedding vector layer at the end, that takes in the token embedding vectors and create a single text embedding, instead of the usual token generation layer.

You can use a regular model to create text embeddings by averaging the token embeddings or just taking only the final token embedding, but it shouldn't be nearly as good as a tuned text embedding model.

[-]

ChristopherCreutzig@reddit

Some model architectures (like BERT and its descendants) start with a special token (traditionally `[CLS]` as the first token, but the text version is completely irrelevant) and use the embedding vector of that token in the output as the document embedding.

That tends to work better in encoder models (again, like BERT) that aren't using causal attention (like a “generate next token” transformer).

[-]

anilozlu@reddit

ModernBERT uses [CLS] embedding + average of all other token embeddings (+ means concatenate in this case) surprisingly

[-]

ChristopherCreutzig@reddit

Sure. One of the options used there for pooling is to return the [CLS] embedding.

[-]

1ncehost@reddit

This isn't entirely true because those token embeddings are used to produce a hidden state which is equivalent to what the embedding algos do. The final hidden state that is used to create the logits vector represents the latent space of the entire input fed to the llm similar to what an embedding algo represents.

[-]

anilozlu@reddit

I meant hidden states by "embeddings that correspond to each input token" to try to keep it simple

[-]

BogaSchwifty@reddit

I’m not an expert here, but from my understanding a normal LLM is a function f that takes as input a context and a token, then it outputs the next token over and over until a termination condition is met. An embedding model vectorizes text. The main application of this model is document retrieval, where you “RAG” (vectorize) multiple documents, vectorize your search prompt, apply a cosine similarity between your vectorized prompt and the vectorized documents, and sort in desc order the results, the higher the score the more relevant a document (or chunk of text) is to your search prompt. I hope that helps.

[-]

WitAndWonder@reddit

Embedding models go through a finetune on a very particular kind of pattern / output (RAG embeddings.) Now you could technically do it with larger models, but why would you? It's massive overkill as the performance really drops off after the 7B mark, and running a larger model to handle it would just be throwing away resources.

[-]

1ncehost@reddit

Its as simple as embedding models have a latent space that is optimized for vector similarity while the latent space of an LLM is optimized for predicting the next token.

Latent vectors are not universal, as they have different sizes and dimensional meaning in different models, but have been shown to be universally transformable by a team recently (dont ask me how or why though).

If you want a compatible latent vector to an LLM just use the latent vectors it produces. You don't need an embedding model for that.

[-]

ab2377@reddit

i fed your question to chatgpt & deepseek, got great answers, you should try too.

[-]

Kooshi_Govno@reddit

I just copied your comment into claude, cus I didn't know well enough to answer:

Your intuition is correct! Here's what's happening:

Regular vs Embedding Models

Regular models CAN do embeddings - tools like AnythingLLM just extract the internal vectors instead of letting the model finish generating text. This works fine.

Specialized embedding models exist because: - They're trained specifically to make similar texts have similar vectors (not just predict next words) - They're smaller, faster, and often produce better semantic representations - They're optimized for the specific task of understanding meaning

The Compatibility Insight

Embeddings from different models are NOT directly compatible. But they don't need to be!

In RAG systems: 1. Embedding model finds relevant documents using vector similarity 2. Language model receives those documents as plain text

The "compatibility" happens at the text level. A 2023 embedding model can absolutely work with a 2025 language model - the embedding model just finds the right text chunks, then hands that text to whatever generation model you're using.

This is why you can mix and match models in RAG pipelines. The embedding model's job is just retrieval; the language model processes the retrieved text like any other input.

So specialized embedding models aren't required, but they're usually better and more efficient at the retrieval task.

[-]

Flashy_Management962@reddit

I get this error when processing big chunks, does anybody know how to fix this? "Out of range float values are not JSON compliant: nan"

[-]

Calcidiol@reddit

I'm just guessing, but if you're using the GGUF Q8 / F16 model then potentially the weights have very significantly less dynamic range than the native BF16 data type model.

Maybe that itself can be a problem and / or maybe it can influence the activation / calculation result data type to also have less precision / accuracy / range than if BF16 or FP32 was used in the key parts of the calculation.

It's plausible at first thought that the big chunks literally accumulate more and more data into a calculation result (proportional to the large chunk size you use) and as more data accumulates the risk of overflow or underflow producing a NaN is higher particularly if using a lower precision / accuracy / range data type somewhere in the calculations.

Maybe see if the same result occurs whether you use Q8, F16, BF16 format model weights and also if you do not quantize the activations but keep them BF16 or whatever is relevant for your configuration.

[-]

EstebanGee@reddit

Maybe a dumb question, but why is a rag better than say an elastic search tool query?

[-]

terminoid_@reddit

it's actually not uncommon to combine BM25 with vectors

[-]

WitAndWonder@reddit

Semantic search (RAG) is focused on the meaning, rather than any arbitrary keywords, collections of letters, phrases, or whatever else that specifically is present in your fields. So a RAG system will be able to search for 'heat', for instance, and even if you have zero document with the word heat, it will still pull up, with varying degrees of similarity/certainty, "thermal", "sun", "fire", "flame", "oven", "warmth". And it gets even better than that since it will consider more than just the specific word, but the actual meanings of the sentences. So 'not warm' will be significantly lower than 'warm', and mentions of sun-dried raisins would likely have very little similarity with a good embedding model, whereas a 'sunny day' may yield high similarity.

[-]

No_Committee_7655@reddit

An elastic search tool query is RAG.

RAG stands for retrieval augmented generation. If you are retrieving sources not featured in the training data to give an LLM additional context data to answer a query that is RAG as you are doing information retrieval.

[-]

ThePixelHunter@reddit

So did anybody save it?

[-]

ThePixelHunter@reddit

Aaaaannnd it's back.

[-]

Competitive_Pass_855@reddit

+1, so weird that they just undid their release

[-]

ThePixelHunter@reddit

This is common. Either it was accidentally public, or it was too early.

[-]

terminoid_@reddit

nice, this is the #1 thing i wanted when the 0.6B dropped

[-]

shibe5@reddit

Aaand it's gone.

[-]

ahmetegesel@reddit

Moved to different collection https://huggingface.co/collections/Qwen/qwen3-embedding-6841b2055b99c44d9a4c371f

[-]

gcavalcante8808@reddit

Yeah ... I guess my bge-m3 comparison will be postponed.

[-]

Key_Medium5886@reddit

There are several embedding models that I can't run in AnythingLLM.

At first, I thought it was a problem on my part, but I've noticed that only the models that LM Studio detects as purely embedding models (not instruct) work.

Therefore, this one doesn't work for me, whether I run it from Ollama, Llama.cpp, or LM Studio... at first, it seems like it does, but it seems that, at least with AnythingLLM, they don't quite work.

Does anyone know where the problem lies?

[-]

Asleep-Ratio7535@reddit

How would you test RAG?

[-]

tucnak@reddit

PageRank

[-]

istinetz_@reddit

another idea is to measure the distances between 3 snippets, 2 from the same document and 1 from a random document. Ideally you want your embedder to have low distance between the 2 snippets from the same document, and high distance between them and the third one. Of course, averaged over a large sample.

[-]

BogaSchwifty@reddit

Build a vectorbase consisting of multiple documents, say Wikipedia. Then, test the vectorbase by asking multiple different prompts (you can have an LLM generate the prompts), if the vectorbase selects the most relevant articles to your search prompt (you can have the LLM decide that), then your model is good.

[-]

noiserr@reddit

Embedding Dimension: Up to 1024, supports user-defined output dimensions ranging from 32 to 1024

This is a really cool feature.

[-]

Ortho-BenzoPhenone@reddit

it is mentioned that they are also launching the 4b and 8b versions. and also text re-rankers. i am not really sure about what these re-rankers are. whether these are embedding similarity based or transformer based (if that even exists), but still quite cool to see.

they have also defeated gemini embeddings (which was the SOTA) till now, and both the 4b and 8b models beat it. kudos to the team!!

[-]

silenceimpaired@reddit

Is this for RAG… and/or what else?

[-]

Ortho-BenzoPhenone@reddit

RAG, text classification, or anything you need to do with embeddings. re-rankers are things that will rank some pieces of text based on a given question/query. like re-ranking search results according to relevance.

[-]

silenceimpaired@reddit

Cool thanks for expanding my knowledge.

[-]

Proto_Particle@reddit (OP)

Qwen team just published back this and all the other embedding and reranking models including safetensors.

[-]

FailingUpAllDay@reddit

"Qwen3-Embedding-0.6B-GGUF" just dropped... and then embedded itself so deeply it disappeared from our reality.

Guess it works too well. Now we need a retrieval model just to find the embedding model. 🤷‍♂️

Edit: In all seriousness though, classic Qwen move - drop a banger that dominates benchmarks at 1/10th the size, then yeet it from existence before anyone can test if it actually runs on their 3090. They're just flexing on us at this point.

[-]

DeepInEvil@reddit

The main problem with embedding models is it doesn't support negations and stuff. Hope with these class of models that problem is somewhat solved

[-]

m18coppola@reddit

just multiply the query vector by -1

[-]

Craftkorb@reddit

Their links to GitHub and blog post are broken. Looks really interesting though, would have to do some checks myself. Multilingual embeddings with MLK is actually pretty hard. Looks like they don't support binary representation though.

[-]

shifty21@reddit

The link OP posted 404s for me.

[-]

Craftkorb@reddit

Interesting, it does not fit me too. They must have published it by accident.

[-]

gcavalcante8808@reddit

Nice, 1024 dimensions. Time to test it against bge-m3

[-]

evnix@reddit

yeah would love to see this, bge-m3 has been my goto so far

[-]

10minOfNamingMyAcc@reddit

Tried to load it in Koboldcpp and only got out of memory errors (even with 10GB free VRAM.) Is it compatible?

[-]

MushroomGecko@reddit

I spent more time than I'd like to admit yesterday on MTEB trying to find the perfect embedding model for my VDB for a RAG app I'm building for a client. Thanks, Qwen. The search is over. Dominating the competition at a fraction of the size (in typical Qwen fashion)

[-]