OLMo 2 Models Released!

[-]

innominato5090@reddit

OLMo core member here! lmk if you have any questions about the release We’re hosting a demo of the 13B instruct at [playground.allenai.org](https://playground.allenai.org)

Reply

[-]

Significant_Focus134@reddit

Nice! Could you share some details why num\_attention\_heads equals num\_hidden\_layers?

Reply

[-]

Does it really? Just coincidence then. The number of layers is determined by the target size we want, and some trade-off between depth and width of the model. The number of attention heads depends on the hidden size and the size of each attention head we want. Unfortunately we can't properly experiment at the top of the scale, so we have to use rules of thumb and save our experimental budget for things we think might have a bigger impact.

Reply

[-]

Significant_Focus134@reddit

Ok, thanks. I'm just interested in what the optimal ratio between hidden size and number of layers would be. In my observations, simply adding additional layers is not optimal without also increasing at least a little bit the number of attention heads.

Reply

[-]

innominato5090@reddit

There's some work studying that at smaller scale, e.g. [Petty et al (2023)](https://arxiv.org/abs/2310.19956) and [Tang et al (2024)](https://www.semanticscholar.org/paper/Rethinking-Optimization-and-Architecture-for-Tiny-Tang-Liu/d0ac639d6ed814eac74b6c39eb5ad46854d8fcc4). We haven't investigated much yet!

Reply

[-]

Significant_Focus134@reddit

Thanks for the links!

Reply

[-]

Willing_Landscape_61@reddit

My main interest in LLM is grounded RAG as I don't want to rely on over fitting for actual knowledge. What is the grounded RAG situation for this model? Can I have chunks with IDs in the context and have the model reference the chunks used for various points in the generated result? (Command R and Nous Hermes have specific prompt formats for that and it would be great to standardized this so that LLM could be easily swapped in a grounded RAG). Thx! ( Also, I am eager for a larger context size, obviously). Thank you very much for your gift to the community with this truly Open Source LLM!

Reply

[-]

innominato5090@reddit

we have a couple different RAG projects, like the [OpenScholar](https://allenai.org/blog/openscholar) demo we just released. Definitely curious to finetune OLMo 2 for that use case!

Reply

[-]

Billy462@reddit

Thanks a lot to you + team, I really enjoy reading the papers you guys publish!

Reply

[-]

innominato5090@reddit

thank you!

Reply

[-]

diaperrunner@reddit

I just checked it out. I talked in Latin. It responded really well in Latin.

Reply

[-]

innominato5090@reddit

woah that’s fun!!

Reply

[-]

Corporate_Drone31@reddit

No questions from me, just a huge thank you. You guys are one of the few truly open source model producers, and I can respect that. Also, I really liked the output style of the first OLMo series, very unique compared to anything else I tested at the time.

Reply

[-]

innominato5090@reddit

means a lot—thanks!

Reply

[-]

jp_digital_2@reddit

Thanks to you and team for this. Definitely hope to learn from / use the source code and architecture in future. From a usage standpoint- can you briefly describe the kind of tasks where this would be on part with state of the art LLMs - I guess there would be some niches where this equals or even exceeds state of the art?

Reply

[-]

innominato5090@reddit

It very solid at math, less so at code (big focus for next iteration). I’ve been asking it trivia questions and it’s pretty good there too!

Reply

[-]

clduab11@reddit

Thank you all for your awesome work and contributions to open-sourcing! I can’t wait to play with the new releases!!

Reply

[-]

innominato5090@reddit

yay! thank you

Reply

[-]

mpasila@reddit

Is it currently supported by Huggingface Transformers? Since I had the latest version installed yet it showed error that it didn't recognize the architecture.

Reply

[-]

innominato5090@reddit

It is merged in Transformers, should be natively supported by next version

Reply

[-]

Amgadoz@reddit

Thanks for the hard work. How multilingual are these models? Can we increase the context length beyond 4k?

Reply

[-]

innominato5090@reddit

they are just English for now; I tried in my native language, and output is intelligible, but really not usable. We want to improve multilingual performance for OLMo 3 for sure. For context extension, hopefully we can do that sooner :)

Reply

[-]

zoontechnicon@reddit

Is there support for system prompts?

Reply

[-]

Many_SuchCases@reddit (OP)

llama.cpp support has been merged: https://github.com/ggerganov/llama.cpp/pull/10394

Reply

[-]

noneabove1182@reddit

Something is still off with the instruct models, can't convert, tokenizer seems different from the base I opened a PR but might still be missing something: https://github.com/ggerganov/llama.cpp/pull/10535

Reply

[-]

innominato5090@reddit

we are aware and are on it! should be able to fix this quickly.

Reply

[-]

noneabove1182@reddit

commented on my PR looks like the pre_tokenizer is missing from the instruct model, but I also don't see any tokens associated with `<|user|>` or `<|system|>` etc, so it's hard to be positive the tokenizer is fine since it'll never tokenize those correctly... but I assume it's working as intended after fixing that?

Reply

[-]

fairydreaming@reddit

It was the same in the recent Tulu 3 model, but the model worked just fine. There is a discussion open: [https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B/discussions/2](https://huggingface.co/allenai/Llama-3.1-Tulu-3-70B/discussions/2) about this, but no answers so far.

Reply

[-]

noneabove1182@reddit

oh *weird*, good find.. I suppose in THEORY it doesn't need them to be special tokens, but it sure is nicer when they are !

Reply

[-]

innominato5090@reddit

we found the bug in our conversion scripts—just doing all checks to make sure nothing is out of order before pushing an update. we are all US-based and tomorrow/friday is a holiday, so it might take till next week to close the loop. apologies about that!

Reply

[-]

DirectorOpen851@reddit

Any update on this? Hope to host this with Ollama soon!

Reply

[-]

robotphilanthropist@reddit

hmmm let us look! Sorry

Reply

[-]

ab2377@reddit

ah, the return of the 13B! i hope we see more of this size from others as well.

Reply

[-]

innominato5090@reddit

precisely our thinking lol. not enough 26B models either… mmmmh

Reply

[-]

mitsu89@reddit

We dont need several different model for every B, just use different sized quants lol.

Reply

[-]

innominato5090@reddit

well two things: 1. we need *bigger* models to quantize, so scaling up would be good 2. there are [limits](https://arxiv.org/pdf/2411.04330) to quantization. At some point, it's better to train smaller, less quantize models than try to run larger models at lower precisions.

Reply

[-]

mitsu89@reddit

obviously. 1 bit quants only produce garbage, 2-3bit quants making mistakes too many times, 4bit quants are starting to be good. This is why i think companies released 3B, 7B, 14B and 30 models so everyone can find an ideal sized quant.

Reply

[-]

Tiny_Thing5607@reddit

I tried it with llama.cpp, wow, it's really singular. Love it

Reply

[-]

AdAppropriate8772@reddit

Does this run with ooba? I cant seem to get it to load.

Reply

[-]

JacketHistorical2321@reddit

What is the significance of these models? Haven't come across them before

Reply

[-]

clduab11@reddit

They're one of the bigger known producers of MoE models (Mixture of Experts). The new releases are trained on 3 trillion tokens (for 7B) and 4 trillion tokens (for 14B). Their training set, Dolma (for the token sets) has an overall big mix of overall Internet content, academic publications (Nature, etc), code libraries, books, etc. it is also fully open source (available on HF and GitHub). A strategy that apparently paid off for these new releases, OLMo-2-7B can perform within \~5 points of Gemma2-9B on the overall average and shrinking down the model by 2B parameters is pretty decent. Not earth-shattering by any means, but unlike Gemma2 (whose *weights* are open source), OLMo-2 is a fully open model, so I think that's pretty significant for the community. We get to see the sausage making and apply the various training and finetune methods for ourselves, along with one of the datasets (Dolma).

Reply

[-]

punkpeye@reddit

Can you explain what's the difference between the 'model' being open source and the weighs being open-source? I thougt the latter allows to re-create the model.

Reply

[-]

LinuxSpinach@reddit

They provide all of the training data so it in theory can be analyzed and you could retrain it from scratch if you wanted to.

Reply

[-]

JawsOfALion@reddit

So that means you can't include copyrighted books or other materials without getting caught

Reply

[-]

marvinalone@reddit

Yep, and that's why we didn't. In fact, we did some last-minute culls from the training data to make sure we respect the licenses that apply.

Reply

[-]

clduab11@reddit

Not quite, but on the right track! Yes, weights are an important part in determining how the model inferences, but it isn’t the whole picture. It’s like trying to say a car is able to vroom because it has the engine in it. It does, but if you don’t have a way of taking the power the engine produces and transferring it into the wheels, you just gonna vroom vroom and go nowhere. Same premise here. Except unlike Google, who will let you see the engine (but not the manufacturing process), AllenAI will give you a whole day seminar on a walk through their plant and how they put the suspension and the transmission in and how that connects to the engine and what the engine specs are, and all that, while all of us here are furiously testing the model and taking notes lmao. It’s not a *perfect* analogy, but I hope that helps enhance your perspective.

Reply

[-]

ninjasaid13@reddit

AllenAI will give you a whole day seminar on a walk through their plant and how they put the suspension and the transmission in and how that connects to the engine and what the engine specs. even with the dataset, deep learning will still be confusing.

Reply

[-]

clduab11@reddit

I mean, yes, technically true, but I feel as if that’s splitting hairs. There’s still very few companies out there who follow AllenAI’s mentality, and releases like this should hopefully spur more development on this front.

Reply

[-]

TheTerrasque@reddit

> Can you explain what's the difference between the 'model' being open source and the weighs being open-source? Weights being "open source" is not really open source. It's more like freeware. You get the resulting "product", but not the source code behind it.

Reply

[-]

Status_Size_6412@reddit

No one except Google can make Gemma-2-9B, but everyone who has the money for it can make an OLMo-2. For leeches like us that means little to nothing, but for people making models from scratch, this "checkpoint" can save them years of time.

Reply

[-]

punkpeye@reddit

Interesting. This is contrary to my previous understanding. So what makes Gemma open-source then?

Reply

[-]

Status_Size_6412@reddit

Gemma is just open-weights. How Google got the weights is anyone's guess, including the data they used in the training, the splits, the methods they used for training, etc. Of course in practice it's leaps and bounds better than what ClosedAI is doing since open weights is more than enough for most people running local models, but for the peeps doing the cool shit, the actual models, this kind of work is super duper useful.

Reply

[-]

whats-a-monad@reddit

How is the data open though? Won't that have copyright issues? Do they just provide urls?

Reply

[-]

clduab11@reddit

That’s not exactly how it works. It’s *really* complicated. There are burgeoning areas of copyright law where fair use litigation can be approached on a case-by-case basis for those that really want to stake a claim, but that kind of litigation is **expensive** to pursue right now, not to mention licensing, where the license a model is released under (and its accompany training *methods*, though not necessarily *the substance*) for companies who produced certain data if they WANT to make that claim, but it isn’t easy as “it’s a copyright issue”. The reason it’s so complicated is because words are taken by the model and “tokenized” and “vectorized”, which essentially means they’re broken down into strings of mathematical data and assigned a place on dimensional graph of sorts, and the mathematical probabilities and combinatorials are the ones that get you your info. It’s not that ablated models know how to break into Fort Knox. They just know, based on how you prompt the model, what words are most associated with “robbery” “Fort Knox” and starts to run the math on which terms are most associated with the words of the prompt you submitted. https://preview.redd.it/ym7lgclr1j3e1.png?width=1251&format=png&auto=webp&s=ade9c21bb88368f4289e52a1ea5375b17e2b156e Here’s a *very* simplified overview of what all goes into asking a model a question and it gives you back an answer.

Reply

[-]

notgreat@reddit

The image you gave is how RAG/context extension works. The actual internal AI part is only the green boxes, and how the AI works internally is a big giant question mark beyond the raw math level.

Reply

[-]

MoffKalast@reddit

\> AllenAI \> Ai2, founded by Paul Allen https://preview.redd.it/q8134qdc0g3e1.jpeg?width=600&format=pjpg&auto=webp&s=0ab4e5d2de8a6b0cc779acc676112c1307bd65a7

Reply

[-]

innominato5090@reddit

ty for the nice explainer, couldn’t have said it better myself

Reply

[-]

kyleboddy@reddit

We use their vision models (Molmo) for basic CV work. They're quite good IME. https://huggingface.co/collections/allenai/molmo-66f379e6fe3b8ef090a8ca19

Reply

[-]

innominato5090@reddit

yay happy to hear!

Reply

[-]

hp1337@reddit

They are fully open. Meaning training data is also made available.

Reply

[-]

JacketHistorical2321@reddit

That i know. I meant more what do they excel At.

Reply

[-]

TrustGraph@reddit

OLMo was the only model, period, that actually meets the Open Source Initiative's definition for Open Source AI. Not sure if that still holds for OLMo2, will have to check it out. I always find it shocking that people call Llama open source when Meta's license agreements explicitly say it is proprietary. Llama's license is also incredibly restrictive, especially for Llama 3.2. Just because it's "free" to "use" (sorta), doesn't make something open source.

Reply

[-]

innominato5090@reddit

It does, but it's actually not the only one! DCLM, MAP-Neo, LLM360 Amber, Zamba 1 & Zamba 2, just to name a few.

Reply

[-]

Feztopia@reddit

They are fully open-source and therefore important for development of better models. The models are just one part of the story they share data and insight.

Reply

[-]

mintyalert@reddit

Can I find the dataset for the pretraining?

Reply

[-]

fairydreaming@reddit

[https://huggingface.co/datasets/allenai/olmo-mix-1124](https://huggingface.co/datasets/allenai/olmo-mix-1124)

Reply

[-]

hugo_choss@reddit

To be super crystal clear: This OLMo-mix-1124 was used for Stage 1 training (regular pretraining). This mix is mostly DCLM-Baseline + some other stuff. For stage 2, we did 3-4 seeds with the [DOLMinos mix](https://huggingface.co/datasets/allenai/dolmino-mix-1124), driving the LR linearly down to near-zero and model-souping before handing it off to post-training. [source: I uploaded these datasets to HF]

Reply

[-]

innominato5090@reddit

thanks for posting this!

Reply

[-]

No-Mountain3817@reddit

https://preview.redd.it/noq0cf6isd3e1.png?width=1512&format=png&auto=webp&s=8111a174901c7a78cad1b82c0e6de60e12520bc4

Reply

[-]

innominato5090@reddit

ok this got me a good chuckle

Reply

[-]

townofsalemfangay@reddit

stop picking on the poor AI lol

Reply

[-]

Healthy-Nebula-3603@reddit

Looks interesting ... from benchmarks Olmo 2 7b indsytruct looks quite similar in performance to llama 3.1 8b instruct

Reply

[-]

robotphilanthropist@reddit

Yeah, lead on post-train here, super excited that the 13b is comprable or even BETTER than 3.1 instruct

Reply

[-]

fairydreaming@reddit

I confirm this, but it's also worse that gemma-2-9b in logical reasoning (checked in farel-bench). It looks like distillation from larger models produces better results than training small models from scratch.

Reply

[-]

innominato5090@reddit

reasoning and code we are a bit weaker, yeah. Team is really excited to work on them for next release though!!

Reply

[-]

sedition666@reddit

Even that in itself is a good progress. Incremental change is great.

Reply

[-]

Toby_Wan@reddit

Max token on instruct model of 2048?? :(

Reply

[-]

mpasila@reddit

I think they mean it was trained on dataset that had max context at 2048 since the base model is 4096 and the instruct model's config says this: "max\_position\_embeddings": 4096,

Reply

[-]

MoffKalast@reddit

Ah, so in RULER terms it's 2k in practice and likely to be incoherent past that.

Reply

[-]

mpasila@reddit

Why would that happen? The base model seems to have been trained on 4k context length. Fine-tuning it on instruct datasets that are shorter than the max context length doesn't really make it worse at longer context lengths but it means the max generated responses will be much shorter.

Reply

[-]

MoffKalast@reddit

I guess it might not be as bad as if the base was 2k, but it still hasn't seen any example of an instruct conversation longer than that in its entirety so I would imagine there are problems with adherence to the format beyond it?

Reply

[-]

mpasila@reddit

But I very much don't think it's going to be "severely degraded" just because of shorter instruct examples used. Most datasets have fairly short examples anyways and most models seem fine even on longer context sizes than 2k.

Reply

[-]

innominato5090@reddit

In our testing, it has been performing just fine on longer instructions (IFEval has few >2k). But we hear the feedback loud and clear, and we will try to prioritize context extension with a point release.

Reply

[-]

llama-impersonator@reddit

if you guys could document context extension and trying it at different stages of the training cycle, that would be absolutely amazing. like difference between continuing pretrain at 16k ctx before the anneal and annealing at 16k ctx vs just anneal at 16k ctx. (for base model). none of us gpu poors have the resources for that!

Reply

[-]

innominato5090@reddit

that’s a great suggestion! definitely worth trying, hopefully some interesting results we can share.

Reply

[-]

robotphilanthropist@reddit

Instruct is trained for 4096 tokens. Most of the tokens are in SFT. At DPO we drop the length to 2048, but it doesnt change anything. Preference data is low length.

Reply

[-]

Small-Fall-6500@reddit

This is incorrect. The base models were trained on a max of 4096 tokens while different stages of the instruction tuning used different context lengths. SFT stage shows "Max. Sequence Length: 4096" DPO stage shows "Max. Sequence Length: 2048" >"max_position_embeddings": 4096, The config.json for both 7b and 13b (base, sft, instruct, etc.) shows 4k ctx. The readme for the base models also clearly says the pretrained context length is 4096. This is still not great, but it's much better than only 2k tokens.

Reply

[-]

sammcj@reddit

4096! That isn't really useful for much short of a basic Q&A conversation as you can't provide it much context at all.

Reply

[-]

Small-Fall-6500@reddit

I agree, but the models are mainly intended for researchers. They're competing for the most capable *fully open model*, not just the most capable model. 4096 context length is likely plenty for almost all research that these models will be used for.

Reply

[-]

MoffKalast@reddit

Right and totally not for looking good on benchmarks and nothing else.

Reply

[-]

Small-Fall-6500@reddit

>Right and totally not for looking good on benchmarks and nothing else I'm not entirely sure what you are referring to here. If you are referring to AllenAI showing in their blogpost how well their models perform on various benchmarks, I would assume that is because a garbage model would attract little attention and thus no researchers looking at or using it. It seems obvious that AllenAI would want their models to "look good on benchmarks" because of this. >There's been virtually no open model with less than 8k context for the past year, because it's useless. There have been zero fully open models released with 8k or more context that have been useful, unless I missed any? Map Neo 7b has 8k context but is almost certainly virtually useless for any practical applications. DCLM 7b and Amber 7b both have 2k context length. K2 65b has 8k context length but is much larger than the Olmo 2 models. OpenCoder 8b has 8k context but is trained mainly on coding and math. I'm also not sure how less than 8k context makes these models "useless" for performing research involving generalization, contamination, memorization and anything else that requires having full access to the model's training data. (Ideally, they would have followed LLM360's approach and uploaded model and training data checkpoints, but Olmo is still a much more open model than something like Qwen). Again, these Olmo models are the best *fully open* models, at least for their sizes. If you only care for how well a model can be run as a chatbot or code assistant or whatever, then you might as well ignore the Olmo models. There are obviously much better models to use for almost any use case *except for* ones that require having access to the model's full training data and code. I would prefer it if Meta, Mistral, Google, and all the other groups who are releasing models could be at least as open as AllenAI, but right now the Olmo models appear to be the best fully open 7b and 13b sized models available.

Reply

[-]

MoffKalast@reddit

Well maybe I'm a bit jaded from corporations *coughs in Microsoft especially* releasing models that are designed to look good on paper and useless in practice. Seeing a model from a company set up by a MS founder makes you consider that the mindset is probably similar. OAI called themselves a non-profit while it suited them. Corporate open source is not open source, it's marketing. > *fully open* models In terms of research it's certainly rare to have both the pretraining and fine tuning datasets available. But we do technically have [the pretraining set for llama-3](https://huggingface.co/datasets/HuggingFaceFW/fineweb) which makes at least the base models almost fully open, minus the hyperparameters. > for performing research involving generalization, contamination, memorization and anything else that requires having full access to the model's training data Sure for doing last year's research, they're absolutely perfect. > then you might as well ignore the Olmo models Exactly. That's why I think it's a bit disingenuous to compare it directly with the likes of Llama, Gemma, Mistral and Qwen in the release when it's not the point.

Reply

[-]

Small-Fall-6500@reddit

I tried to list out every fully open model I know of, but I probably missed some. If anyone knows of any I missed, please let me know. **Fully Open LLMs** [**OLMo 2** \- a allenai Collection](https://huggingface.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc) * 7b and 13b with 4k context * Base, SFT, DPO, Instruct * Datasets available (\~200 MB files) [**OLMo** Suite - a allenai Collection](https://huggingface.co/collections/allenai/olmo-suite-65aeaae8fe5b6b2122b46778) * 7b, 2k and 4k context versions trained * Olmo v1 models, several different versions * Dataset urls uploaded to HF, actual data is on [olmo-data.org](http://olmo-data.org) [**OLMoE** \- a allenai Collection](https://huggingface.co/collections/allenai/olmoe-66cf678c047657a30c8cd3da) * 7b MoE with 1b active, 4k context * 1.5B active and 7.2B total parameters * Datasets available (\~4 GB files) [**K2** \- a LLM360 Collection](https://huggingface.co/collections/LLM360/k2-6622ae6911e3eb6219690039) * 65b with 8k context * Datasets available (\~20-40 GB files) * 360 model and data checkpoints from training

Reply

[-]

Small-Fall-6500@reddit

[**Amber** \- a LLM360 Collection](https://huggingface.co/collections/LLM360/amber-65e7333ff73c7bbb014f2f2f) * 7b, 2k context * Datasets available * 360 model and data checkpoints from training [**OpenCoder** \- a infly Collection](https://huggingface.co/collections/infly/opencoder-672cec44bbb86c39910fb55e) * 8b and 1.5b, 8k and 4k context * Base and Instruct * Datasets available (300 MB files) [**DCLM** \- a apple Collection](https://huggingface.co/collections/apple/dclm-66960ebf2400d314ff19018f) * 7b, 2k context with an extended 8k context version * Datasets available (\~300 MB files) [**Neo**\-Models - a m-a-p Collection](https://huggingface.co/collections/m-a-p/neo-models-66395a5c9662bb58d5d70f04) * 7b, 8k context * Datasets available: [Neo Datasets - Collection](https://huggingface.co/collections/m-a-p/neo-datasets-66395dc55cbebc0a7767bbd5) (\~40 GB files, separated by category) [**Zamba2**\-7B by Zyphra - Hugging Face](https://huggingface.co/Zyphra/Zamba2-7B) * 1.2b, 2.7b, and 7b with 4k context * Hybrid mamba transformer * Datasets available: [Zyphra/Zyda-2 · Datasets at Hugging Face](https://huggingface.co/datasets/Zyphra/Zyda-2) * Combined DCLM and Zyda (\~150 MB files) Almost all of these are 7b or smaller, except for K2 65 and Olmo 2 13b. Every one of these has 8k or less context length.

Reply

[-]

Small-Fall-6500@reddit

[RedPajama-INCITE-7B by togethercomputer - Hugging Face](https://huggingface.co/togethercomputer/RedPajama-INCITE-7B-Base) * 7b and 3b, 2k context * Dataset urls uploaded to HF: [togethercomputer/RedPajama-Data-1T · Datasets at Hugging Face](https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T), actual dataset on [data.together.xyz](http://data.together.xyz)

Reply

[-]

innominato5090@reddit

responded somewhere else, but context extension should be fairly easy to do without retraining from scratch. Feedback here is important, we will try to prioritize.

Reply

[-]

SiEgE-F1@reddit

True. But we'll get there, eventually. Even Llama wasn't that smart at the beginning of its life, and it took it half a year to get a breakthrough.. and people who created it were actually payed regularly.

Reply

[-]

innominato5090@reddit

both models support up to 4k context!

Reply

[-]

extopico@reddit

That’s still terrible as that includes prompt and generation.

Reply

[-]

MoffKalast@reddit

Yeah like, you gotta allocate at least 512-1k for generation, maybe a few hundred for the system prompt, so realistically something over 2k for the actual conversation which is like llama-1 tier.

Reply

[-]

innominato5090@reddit

hearing y'all loud and clear! we have plans to explore context extension. with the two stage pretraining we have been using, we can pack all long context in Stage 2, so should be fairly economical.

Reply

[-]

extopico@reddit

Thank you. Now LLMs are no longer a novelty, or sexbots. I use them for comprehension, in batch jobs where I cannot and do not want to control the prompt length. There is zero chance I will ever try a model with a small context size since beyond all the headache of setting up the pipeline the last thing I want to see is a model API returning an error or truncated/malformed response due to running out of context

Reply

[-]

Healthy-Nebula-3603@reddit

what? LOL

Reply

[-]

TrustGraph@reddit

The repo says 4096? [https://github.com/allenai/OLMo?tab=readme-ov-file#overview](https://github.com/allenai/OLMo?tab=readme-ov-file#overview)

Reply

[-]

Billy462@reddit

This release is extremely significant. For those that don't know Allen AI are a research institute who are releasing **completely** open models. That means that all of their results can be reproduced (and improved upon) from scratch. Maybe you knew that, why did I say "extremely significant": This release has a model OLMo 2 13b, which according to benchmarks matches or exceeds Qwen 2.5 7b, LLama 3.1 8b, Gemma2 9b and is only slightly behind Qwen 2.5 14b. This is with 5T tokens only too...

Reply

[-]