Llama3.3:70b vs GPT-OSS:20b for PHP Code Generation

Posted by AppledogHu@reddit | LocalLLaMA | View on Reddit | 13 comments

Hi! I like PHP, Javascript, and so forth, and I'm just getting into ollama and trying to figure out which models I should use. So I ran some tests and wrote some long, windy blog posts. I don't want to bore you with those so here's a gpt-oss:120b generated re-write for freshness and readability of what I came up with. Although, I did check it and edit a few things. Welcome to the future!

Title: Llama 3.3 70B vs GPT‑OSS 20B – PHP code‑generation showdown (Ollama + Open‑WebUI)

TL;DR

Feature	Llama 3.3 70B	GPT‑OSS 20B
First‑token latency	10–30 s	~15 s
Total generation time	1 – 1.5 min	~40 s
Lines of code (average)	95 ± 15	165 ± 20
JSON correctness	✅ 3/4 runs, 1 run wrong filename	✅ 3/4 runs, 1 run wrong filename (story.json.json)
File‑reconstruction	✅ 3/4 runs, 1 run added stray newlines	✅ 3/4 runs, 1 run wrong “‑2” suffix
Comment style	Sparse, occasional boiler‑plate	Detailed, numbered sections, helpful tips
Overall vibe	Good, but inconsistent (variable names, refactoring, whitespace handling)	Very readable, well‑commented, slightly larger but easier to understand

Below is a single, cohesive post that walks through the experiment, the numbers, the code differences, and the final verdict.

1. Why I ran the test

I wanted a quick, repeatable way to see how Ollama‑served LLMs handle a real‑world PHP task:

Read a text file, tokenise it, build an array of objects, write a JSON summary, and re‑create the original file.

The prompt was deliberately detailed (file‑name handling, whitespace handling, analytics, etc.) and I fed exactly the same prompt to each model in a fresh chat (no prior context).

2. Test harness

Step	What I did
Prompt	Same multi‑paragraph description for both models.
Runs per model	4 independent generations (to catch variability).
Environment	Ollama + Open‑WebUI (context persists only within a single chat).
Metrics collected	• First‑token latency (time to the first visible token) • Total generation time • Lines of code (excluding blank lines) • JSON file correctness • Re‑generated text file correctness • Subjective readability of the code/comments.

3. Speed & latency

Model	First‑token latency	Total time (average)
Llama 3.3 70B	10–30 s (often ~20 s)	1 – 1.5 min
GPT‑OSS 20B	~15 s	~40 s

Even though Llama 3.3 felt “slow to start”, it still finished within a minute and a half. GPT‑OSS was noticeably snappier.

4. Code size & structure

Model	Avg. SLOC	Notable structural quirks
Llama 3.3 70B	95 ± 15	• Variable names changed between runs (e.g., `$outputFilename` vs `$outBase`). • Some runs used `file_put_contents()`, others used `fopen()/fwrite()`. • Inconsistent handling of whitespace tokens in the JSON (sometimes a boolean, sometimes omitted).
GPT‑OSS 20B	165 ± 20	• Heavier commenting (numbered sections, “what‑this‑does” bullet points). • Consistent use of `file_put_contents()` for both JSON and text output. • More explicit error handling.

Overall, I much prefer the kind of commentary and code structure produced by GPT-OSS.

4.1. Sample header comparison

GPT‑OSS 20B (first ~12 lines)

<?php
/**
 * a1.php
 *
 * 1. Reads a text file (via GET or POST: file=)
 * 2. Tokenises it into words & whitespace
 * 3. Builds an array of objects:
 *      - id          : the token number in the file
 *      - t           : the exact token as it appears
 *      - whitespace  : true for whitespace tokens, false otherwise
 *      - w           : processed word (keeps dash/apostrophe, removes punctuation)
 * 4. Stores the array, word‑count and elapsed time in JSON
 * 5. Writes a plain‑text file that contains all obj->t in order,
 *    with “‑2” inserted before the extension.
 *
 * The script is deliberately written step‑by‑step for teaching.
 */

Llama 3.3 70B (first ~12 lines)

<?php
// Combine GET and POST requests
$request = array_merge($_GET, $_POST);
// Check if file is provided
if (!isset($request['file'])) {
    echo "Error: No file provided.";
    exit;
}
$filename = $request['file'];
// Set output filename
if (isset($request['out']) && $request['out'] != '') {
    $output_filename = $request['out'];
} else {
    $parts = explode('.', $filename);
    $output_filename = $parts[0] . '.json';
}

The GPT‑OSS header reads like a short design document, while Llama’s header is non-existant. GPT-OSS wins hands down on structure and commenting.

5. JSON output quality

Both models produced human‑readable JSON in the majority of runs. The main hiccups:

Model	Issue	Frequency
Llama 3.3 70B	Wrong filename handling (`filename.json.json`) – run 4	1/4
GPT‑OSS 20B	Same filename bug (`story.json.json`) – run 2	1/4
Both	Off‑by‑one word count in one run (4650 vs. 4651)	1/4 each

All other runs generated a complete JSON object with num_words, processing_time, and the full token array. However, some runs of Llama3.3:70b-instruct produced correct but unreadable (by humans) JSON code.

6. Re‑creating the original text file

Model	Mistake(s)	How obvious was it?
Llama 3.3 70B	In run 4 the function added a newline after every token (`fwrite($file, $token->t . "\n");`). This produced a file with extra blank lines.	Visible immediately when diff‑ing with the source.
GPT‑OSS 20B	Run 2 wrote the secondary file as `story.json-2.txt` (missing the “‑2” before the extension).	Minor, but broke the naming convention.
Both	All other runs reproduced the file correctly.	—

7. Readability & developer experience

7.1. Llama 3.3 70B

Pros

Generates usable code quickly once the first token appears.
Handles most of the prompt correctly (JSON, tokenisation, analytics).

Cons

Inconsistent naming and variable choices across runs.
Sparse comments – often just a single line like “// Calculate analytics”.
Occasionally introduces subtle bugs (extra newlines, wrong filename).
Useless comments after the code. It's more conversational.

7.2. GPT‑OSS 20B

Pros

Very thorough comments, broken into numbered sections that match the original spec.
Helpful “tips” mapped to numbered sections in the code (e.g., regex explanation for word cleaning).
Helpful after-code overview which reference numbered sections in the code. This is almost a game changer, just by itself.
Consistent logic and naming across runs (reliable!)
Consistent and sane levels of error handling (die() with clear messages).

Cons

None worth mentioning

8. “Instruct” variant of Llama 3.3 (quick note)

I also tried llama3.3:70b‑instruct‑q8_0 (4 runs).

Latency: highest 30 s – 1 min to first token, ~2 to 3 min total.
Code length similar to the regular 70 B model.
Two runs omitted newlines in the regenerated text (making it unreadable).
None of the runs correctly handled the output filename (all clobbered story-2.txt).

Conclusion: the plain llama3.3 70B remains the better choice of the two Llama variants for this task.

9. Verdict – which model should you pick?

Decision factor	Llama 3.3 70B	GPT‑OSS 20B
Speed	Slower start, still < 2 min total.	Faster start, sub‑minute total.
Code size	Compact, but sometimes cryptic.	Verbose, but self‑documenting.
Reliability	75 % correct JSON / filenames.	75 % correct JSON / filenames.
Readability	Minimal comments, more post‑generation tinkering.	Rich comments, easier to hand‑off.
Overall “plug‑and‑play”	Good if you tolerate a bit of cleanup.	Better if you value clear documentation out‑of‑the‑box.

My personal take: I’ll keep Llama 3.3 70B in my toolbox for quick one‑offs, but for any serious PHP scaffolding I’ll reach for GPT‑OSS 20B (or the 120B variant if I can spare a few extra seconds).

10. Bonus round – GPT‑OSS 120B

TL;DR – The 120‑billion‑parameter variant behaves like the 20 B model but is a bit slower and produces more and *better code and commentary. Accuracy goes up. (≈ 100 % correct JSON / filenames).

Metric	GPT‑OSS 20B	GPT‑OSS 120B
First‑token latency	~15 s	≈ 30 s (roughly double)
Total generation time	~40 s	≈ 1 min 15 s
Average SLOC	165 ± 20	190 ± 25 (≈ 15 % larger)
JSON‑filename bug	1/4 runs	0/4 runs
Extra‑newline bug	0/4 runs	0/4 runs
Comment depth	Detailed, numbered sections	Very detailed – includes extra “performance‑notes” sections and inline type hints
Readability	Good	Excellent – the code seems clearer and the extra comments really help

12.1. What changed compared with the 20 B version?

Latency: The larger model needs roughly twice the time to emit the first token. Once it starts, the per‑token speed is similar, so the overall time is only 10-30 s longer.
Code size: The 120 B model adds a few more helper functions (e.g., sanitize_word(), format_elapsed_time()) and extra inline documentation. The extra lines are mostly comments, not logic.
Bug pattern: All of the same bugs that appeared in the 20 B runs re‑appear (wrong *.json.json filename, correct “‑2” suffix). In addition, run 3 introduced an unwanted newline when rebuilding the original file:

```php function rebuild_text(array $tokens, string $filename): void { $fh = fopen($filename, 'w'); foreach ($tokens as $tok) { // The 120B model mistakenly adds "\n" here in one of the runs fwrite($fh, $tok->t . "\n"); } fclose($fh); }

11. Bottom line

Both Llama 3.3 70B and GPT‑OSS 20B can solve the same PHP coding problem, but they do it with different trade‑offs:

Llama 3.3 70B – smaller, faster to read once generated, but inconsistent and occasionally buggy.
GPT‑OSS 20B – larger, more verbose, but gives you a ready‑to‑read design document in the code itself.

If you need quick scaffolding and are comfortable fixing a few quirks, Llama 3.3 70B is still a solid free option.
If you prefer well‑commented, “just‑drop‑in” code and can tolerate a slightly larger file, GPT‑OSS 20B (or its 120B sibling) is the way to go.

Happy coding, and may your prompts be clear!

[-]

Long_comment_san@reddit

Nice, it's cool to read about real world tasks like that. I don't code as a job but with the quality of recent models I just might. I hope people poking won't discourage you.

[-]

AppledogHu@reddit (OP)

No it's great! It seems like lm studio is better than ollama for some tasks. I'm actually surprised though since I read ollama is preferred for developers. I don't find any issues with it. I'll try LM studio later, of course. The thing is, I discovered that both of the models I reviewed are *good enough* for what I need. I'm sure there are better models but it is interesting to me that it probably doesn't matter as much as some people think. I have a sneaking suspicion that using a model aligned with your task, and using them in the right workflow is more important than having the best and latest model. That's why I compared llama3.3:70b and gpt-oss:20b. As it turns out, gpt-oss is way more aligned for coding than llama. I find that very interesting.

[-]

No_Gold_8001@reddit

It makes no sense to say that ollama is better to some tasks than lmstudio… why do you think it is better for “tooling and developers”?

Both will be running the same llama.cpp backend. Behind the curtains they are literally the same thing.

[-]

AppledogHu@reddit (OP)

If it's running the same backend then why do some people say it's better or worse? Also, I looked around and couldn't find any real comparisons, reviews or even benchmarks featuring these LLM models. There are a few, but unless you know where I can find one I'm left to perform such work myself. As an example, I found gpt-oss produces much better code than qwen3-coder. I was surprised. I wonder if it's different on a language-by-language basis?

[-]

No_Gold_8001@reddit

You can configure llama.cpp in many ways.. Lmstudio allows you to select stuff more easily, ollama requires you to set those via env vars but it is also not as obvious and imho they have some bad naming conventions and defaults that make stuff very confusing for newcomers. That is the reason why I recommend lmstudio over ollama. For non mac machines they are running the same software but the lmstudio is easier to manage.

vLLM is in a different class of software. You normally use llama.cpp because it is easy and allows you to offload the model partially for the ram and cpu. vLLM is kinda of the opposite, it is great at batching and at running models directly from the gpu and gpu alone.

So if you need to support hundreds of users or you batching thousands of requests vLLM will be much faster.

When I am running a models on my mac I use lmstudio with MLX. When I am running in a server for multiple users I use vLLM.

In a very oversimplified way this is the difference.