ByteDance-Seed/Cola-DLM · Hugging Face

Posted by pmttyji@reddit | LocalLLaMA | View on Reddit | 9 comments

Cola DLM (Continuous Latent Diffusion Language Model) is a hierarchical continuous latent-space diffusion language model. It combines a Text VAE with a block-causal Diffusion Transformer (DiT) prior: the VAE maps text into continuous latent sequences and decodes latents back to tokens, while the DiT performs latent prior transport through Flow Matching.

This model repository contains the HuggingFace-format checkpoint for the paper Continuous Latent Diffusion Language Model.

Model Details

Architecture: Text VAE + block-causal DiT latent prior.
Training objective: two-stage training with Text VAE pretraining followed by joint Text VAE + DiT training using Flow Matching.
Training-compute checkpoint: the released weights correspond to the 2000 EFLOPs checkpoint reported in the paper's RQ4 scaling curve.
Tokenizer: OLMo 2 tokenizer with a 100,278-entry vocabulary.
Special token ids: pad_token_id=100277, eos_token_id=100257, im_end_token_id=100265.
Framework: PyTorch 2.1+ and HuggingFace Transformers 4.40+.
License: Apache License 2.0.

13. Scaling Experiments

At \~2B parameters, \~2000 EFLOPs, and under strictly matched comparisons, Cola DLM's hierarchical continuous latent prior modeling demonstrates a meaningful scaling trend.

Under strict matching (AR, LLaDA, and Cola DLM all have \~1.8B non-embedding backbone, \~2B total parameters with embedding), baselines are trained independently. Cola DLM uses \~500M VAE + 1.8B DiT; AR and LLaDA also keep their non-embedding backbones at \~1.8B;

All models are evaluated under a unified few-shot generative protocol: LAMBADA and SQuAD as generative tasks; multiple-choice tasks are also cast as few-shot generation;

Scaling curves go up to \~2000 EFLOPs.

[-]

Dolsis@reddit

Yet another cuda or CPU model.

Still waiting for diffusion models I can run on with Vulkan.

Maybe i missed something but I dont think I can use my (AMD) 7900 RT GPU to run it (rocm support with this card is meh on fedora. Maybe I should use Ubuntu only for these use cases ?)

I have the same disappointment with the qwen-image model.

This_Maintenance_834@reddit

cuda is the de facto first class citizen. if playing with newest model becomes important, I guess people have to just sell the AMD and buy a nvidia. i know it costs significantly additional money.

a_slay_nub@reddit

MMLU of 19? I thought random guessing was 25?

xeeff@reddit

Cola DLM is intended primarily for research on hierarchical latent-variable language models, continuous latent diffusion for text, Flow Matching priors, and benchmark-style text generation.

This checkpoint is not instruction-tuned and has not gone through RLHF. It should not be treated as a production chatbot or used for safety-critical decision making.

pulled from hf

Silver-Champion-4846@reddit

How many params exactly?

pmttyji@reddit (OP)

From Blog post:

Under strict matching (AR, LLaDA, and Cola DLM all have \~1.8B non-embedding backbone, \~2B total parameters with embedding), baselines are trained independently. Cola DLM uses \~500M VAE + 1.8B DiT; AR and LLaDA also keep their non-embedding backbones at \~1.8B;
All models are evaluated under a unified few-shot generative protocol: LAMBADA and SQuAD as generative tasks; multiple-choice tasks are also cast as few-shot generation;
Scaling curves go up to \~2000 EFLOPs.

Hmmmm 2b... gguf probably needs llamacpp support of this arch first

kevinlch@reddit

If i understood correctly, model is now aimed to deliver "meaning" instead of "most probable next word". so the model have to solve the absolute meaning first and then populate the token sequences later. so the user can pick any length/any level of details for the output. hallucination will be very very low. correct?

j_osb@reddit

Wow this is

This is very exciting. Hope to see some more support for this.

Links

Model Details

13. Scaling Experiments