TheaterFire

‘chain of draft’ could cut AI costs by 90%

Posted by Zyj@reddit | LocalLLaMA | View on Reddit | 18 comments

Reply to Post

18 Comments

FrostyContribution35@reddit

I thought the whole point of CoT was to give models more time to think, rather than resorting to curt zero shot answers.
View on Reddit #50691250

Fit-Run5017@reddit

All reasoning is poo anyway, Order Doesn’t Matter, But Reasoning Does: Training LLMs with Order-Centric Augmentation. https://arxiv.org/html/2502.19907v1
View on Reddit #50671382

Chromix_@reddit

Yes, it can cut AI cost while also cutting result quality. In [my tests](https://www.reddit.com/r/LocalLLaMA/comments/1j0uoht/comment/mgkdqg3/?context=3) CoD decreased the [SuperGPQA](https://www.reddit.com/r/LocalLLaMA/comments/1j3byj5/bytedance_unveils_supergpqa_a_new_benchmark_for/) score, which probably has more weight than a few hand-picked benchmarks. Also see other comments in that thread for more information. Keep in mind that the results are also not accurately reproducible because the authors didn't publish their full few-shot prompt.
View on Reddit #50620129

AppearanceHeavy6724@reddit

conform, it sucks. Useless.
View on Reddit #50627389

Cergorach@reddit

Depends, Chromix\_ used a tiny LLM (7b), do they get the same results with the large models? Testing on one small model isn't representative either.
View on Reddit #50636660

Chromix_@reddit

It's one small model they also tested in the paper though. Well, almost - they used Qwen 3B.
View on Reddit #50671105

AppearanceHeavy6724@reddit

I tried with 14b (Qwen) and 12b (Nemo) and was not impressed either.
View on Reddit #50637978

Cergorach@reddit

I would also call those tiny LLMs. Can you reproduce it with Claude 3.5 (like they did in the paper)? Or try it with 405b or 671b?
View on Reddit #50639795

AppearanceHeavy6724@reddit

No, as my hardware is too weak for that.
View on Reddit #50639963

MizantropaMiskretulo@reddit

>it can cut AI cost while also cutting result quality. > ... > there was no improvement when testing with Qwen 2.5 7B To be fair, it could also be that smaller, weaker models just need more scaffolding. For a model like 3.5 Sonnet, the extra tokens might be mostly redundant while Qwen 2.5 7B might need all the help it can get. It may just be this technique is more applicable to models in the 32B, 70B, or 400B parameter range where decreasing token counts is even more important? A model like GPT 4.5 may especially benefit from fewer random, divergent "thoughts" and someone's wallet definitely will when it's being billed at $150/Mtok.
View on Reddit #50633876

Chromix_@reddit

>It may just be this technique is more applicable to models in the 32B, 70B, or 400B parameter range where decreasing token counts is even more important? It certainly saves more when applied to more expensive models. Yet we're in /LocalLlama here and the authors explicitly included smaller models and claimed a significant benefit for them in their paper: >Qwen2.51.5B/3B instruct \[...\] While CoD effectively reduces the number of tokens required per response and improves accuracy over direct answer, its performance gap compared to CoT is more pronounced in these models.
View on Reddit #50637576

MizantropaMiskretulo@reddit

>Yet we're in /LocalLLaMA here Yes and the 405B llamas and R1 are expensive to run. >explicitly included smaller models Yeah, I admittedly only skimmed the paper and stopped prior to the small models section, but they do also say the full CoT does better than their method. There's also another issue at play which needs to be considered... They didn't demonstrate any examples with multiple choice questions, so that's certainly a confounding factor. Also it seems you didn't really follow their format. ```text Question: A microwave oven is connected to an outlet, 120 V, and draws a current of 2 amps. At what rate is energy being used by the microwave oven? A) 240 W B) 120 W C) 10 W D) 480 W E) 360 W F) 200 W G) 30 W H) 150 W I) 60 W J) 300 W Answer: voltage times current. 120 V * 2 A = 240 W. Answer: A. ``` You have two `Answer` fields and your chain of draft could be better. E.g.: ```text Answer: energy: watts; W = V * A; 120V * 2A = 240W; #### A ``` I'm just saying invalidating their results requires a bit more rigor.
View on Reddit #50641590

Chromix_@reddit

>They didn't demonstrate any examples with multiple choice questions Well, they had yes/no questions, which are the smallest multiple-choice questions. They also have calculated results. If the LLM can calculate the correct number then it should be capable of also finding and writing the letter next to that number. >You have two `Answer` fields and your chain of draft could be better. Yes, I asked Mistral to transfer the existing CoT from SuperGPQA [five-shot](https://github.com/SuperGPQA/SuperGPQA/blob/main/config/prompt/five-shot.yaml) (which has two answers) to the CoD format and I think it did reasonably well. If the proposed method requires a closer adaption to the query content, thus if the model cannot reasonably generalize the process on its own, then it becomes less relevant in practice since there'll be no one to adapt the few-shot examples for each user query. >I'm just saying invalidating their results requires a bit more rigor. Oh, I'm not invalidating the published results at all, as the paper didn't contain everything needed to accurately reproduce them (no appendix). I tried different variations on different benchmarks. All I did was to show that the approach described in the paper does not generalize, at least not for the small Qwen 3B and 7B models that I've tested. Generalization would be the most important property for others to switch to CoD.
View on Reddit #50643899

MizantropaMiskretulo@reddit

>Well, they had yes/no questions, which are the smallest multiple-choice questions. Lol. No. There's a fundamental difference between true/false questions and multiple choice. > They also have calculated results. If the LLM can calculate the correct number then it should be capable of also finding and writing the letter next to that number. Again, fundamentally different. It seems as though you just didn't understand the paper and don't understand how LLMs actually work.
View on Reddit #50647378

Chromix_@reddit

Adapted five-shot.yaml from SuperGPQA in case someone wants to reproduce this: ``` prompt_format: - | Answer the following multiple choice question. There is only one correct answer. Think step by step, but only keep minimum draft for each thinking step, with 5 words at most. The last line of your response should be in the format 'Answer: $LETTER' (without quotes), where LETTER is one of A, B, C, D, E, F, G, H, I, or J. Question: A refracting telescope consists of two converging lenses separated by 100 cm. The eye-piece lens has a focal length of 20 cm. The angular magnification of the telescope is A) 10 B) 40 C) 6 D) 25 E) 15 F) 50 G) 30 H) 4 I) 5 J) 20 Answer: Telescope: two converging lenses; Separation: 100 cm; Eye-piece focal length: 20 cm. Other lens focal length: 80 cm. Magnification: 80/20 = 4. Answer: H. Question: Say the pupil of your eye has a diameter of 5 mm and you have a telescope with an aperture of 50 cm. How much more light can the telescope gather than your eye? A) 1000 times more B) 50 times more C) 5000 times more D) 500 times more E) 10000 times more F) 20000 times more G) 2000 times more H) 100 times more I) 10 times more J) N/A Answer: Light gathering: proportional to area. Area: $\pi \left(\frac{{D}}{{2}}\right)^2$. Relative light-gathering power: $\frac{{\left(\frac{{50 \text{{ cm}}}}{{2}}\right)^2}}{{\left(\frac{{5 \text{{ mm}}}}{{2}}\right)^2}} = 10000$. Answer: E. Question: Where do most short-period comets come from and how do we know? A) The Kuiper belt; short period comets tend to be in the plane of the solar system like the Kuiper belt. B) The asteroid belt; short period comets tend to come from random directions indicating a spherical distribution of comets called the asteroid belt. C) The asteroid belt; short period comets tend to be in the plane of the solar system just like the asteroid belt. D) The Oort cloud; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the Oort cloud. E) The Oort Cloud; short period comets tend to come from random directions indicating a spherical distribution of comets called the Oort Cloud. F) The Oort cloud; short period comets tend to be in the plane of the solar system just like the Oort cloud. G) The asteroid belt; short period comets have orbital periods similar to asteroids like Vesta and are found in the plane of the solar system just like the asteroid belt. Answer: Short-period comets: Kuiper belt; Orbits: plane of solar system. Answer: A. Question: Colors in a soap bubble result from light A) dispersion B) deflection C) refraction D) reflection E) interference F) converted to a different frequency G) polarization H) absorption I) diffraction J) transmission Answer: Soap bubble colors: light interference. Answer: E. Question: A microwave oven is connected to an outlet, 120 V, and draws a current of 2 amps. At what rate is energy being used by the microwave oven? A) 240 W B) 120 W C) 10 W D) 480 W E) 360 W F) 200 W G) 30 W H) 150 W I) 60 W J) 300 W Answer: voltage times current. 120 V * 2 A = 240 W. Answer: A. Question: {} Answer: ```
View on Reddit #50632229

frivolousfidget@reddit

For me it also generated much worse code. Qwq went from ~15k tokens to only ~3k but the quality suffered a lot.
View on Reddit #50623948

Feztopia@reddit

Have you tried chain of draft with non reasoning models and how was the effect on them?
View on Reddit #50621743

BlipOnNobodysRadar@reddit

tl;dr it's just a prompt change to get reasoning models to be concise in their chain of thought
View on Reddit #50619376