Llama3.3:70b vs GPT-OSS:20b for PHP Code Generation
Posted by AppledogHu@reddit | LocalLLaMA | View on Reddit | 13 comments
Hi! I like PHP, Javascript, and so forth, and I'm just getting into ollama and trying to figure out which models I should use. So I ran some tests and wrote some long, windy blog posts. I don't want to bore you with those so here's a gpt-oss:120b generated re-write for freshness and readability of what I came up with. Although, I did check it and edit a few things. Welcome to the future!
Title: Llama 3.3 70B vs GPT‑OSS 20B – PHP code‑generation showdown (Ollama + Open‑WebUI)
TL;DR
| Feature | Llama 3.3 70B | GPT‑OSS 20B |
|---|---|---|
| First‑token latency | 10–30 s | ~15 s |
| Total generation time | 1 – 1.5 min | ~40 s |
| Lines of code (average) | 95 ± 15 | 165 ± 20 |
| JSON correctness | ✅ 3/4 runs, 1 run wrong filename | ✅ 3/4 runs, 1 run wrong filename (story.json.json) |
| File‑reconstruction | ✅ 3/4 runs, 1 run added stray newlines | ✅ 3/4 runs, 1 run wrong “‑2” suffix |
| Comment style | Sparse, occasional boiler‑plate | Detailed, numbered sections, helpful tips |
| Overall vibe | Good, but inconsistent (variable names, refactoring, whitespace handling) | Very readable, well‑commented, slightly larger but easier to understand |
Below is a single, cohesive post that walks through the experiment, the numbers, the code differences, and the final verdict.
1. Why I ran the test
I wanted a quick, repeatable way to see how Ollama‑served LLMs handle a real‑world PHP task:
Read a text file, tokenise it, build an array of objects, write a JSON summary, and re‑create the original file.
The prompt was deliberately detailed (file‑name handling, whitespace handling, analytics, etc.) and I fed exactly the same prompt to each model in a fresh chat (no prior context).
2. Test harness
| Step | What I did |
|---|---|
| Prompt | Same multi‑paragraph description for both models. |
| Runs per model | 4 independent generations (to catch variability). |
| Environment | Ollama + Open‑WebUI (context persists only within a single chat). |
| Metrics collected | • First‑token latency (time to the first visible token) • Total generation time • Lines of code (excluding blank lines) • JSON file correctness • Re‑generated text file correctness • Subjective readability of the code/comments. |
3. Speed & latency
| Model | First‑token latency | Total time (average) |
|---|---|---|
| Llama 3.3 70B | 10–30 s (often ~20 s) | 1 – 1.5 min |
| GPT‑OSS 20B | ~15 s | ~40 s |
Even though Llama 3.3 felt “slow to start”, it still finished within a minute and a half. GPT‑OSS was noticeably snappier.
4. Code size & structure
| Model | Avg. SLOC | Notable structural quirks |
|---|---|---|
| Llama 3.3 70B | 95 ± 15 | • Variable names changed between runs (e.g., $outputFilename vs $outBase). • Some runs used file_put_contents(), others used fopen()/fwrite(). • Inconsistent handling of whitespace tokens in the JSON (sometimes a boolean, sometimes omitted). |
| GPT‑OSS 20B | 165 ± 20 | • Heavier commenting (numbered sections, “what‑this‑does” bullet points). • Consistent use of file_put_contents() for both JSON and text output. • More explicit error handling. |
Overall, I much prefer the kind of commentary and code structure produced by GPT-OSS.
4.1. Sample header comparison
GPT‑OSS 20B (first ~12 lines)
<?php
/**
* a1.php
*
* 1. Reads a text file (via GET or POST: file=)
* 2. Tokenises it into words & whitespace
* 3. Builds an array of objects:
* - id : the token number in the file
* - t : the exact token as it appears
* - whitespace : true for whitespace tokens, false otherwise
* - w : processed word (keeps dash/apostrophe, removes punctuation)
* 4. Stores the array, word‑count and elapsed time in JSON
* 5. Writes a plain‑text file that contains all obj->t in order,
* with “‑2” inserted before the extension.
*
* The script is deliberately written step‑by‑step for teaching.
*/
Llama 3.3 70B (first ~12 lines)
<?php
// Combine GET and POST requests
$request = array_merge($_GET, $_POST);
// Check if file is provided
if (!isset($request['file'])) {
echo "Error: No file provided.";
exit;
}
$filename = $request['file'];
// Set output filename
if (isset($request['out']) && $request['out'] != '') {
$output_filename = $request['out'];
} else {
$parts = explode('.', $filename);
$output_filename = $parts[0] . '.json';
}
The GPT‑OSS header reads like a short design document, while Llama’s header is non-existant. GPT-OSS wins hands down on structure and commenting.
5. JSON output quality
Both models produced human‑readable JSON in the majority of runs. The main hiccups:
| Model | Issue | Frequency |
|---|---|---|
| Llama 3.3 70B | Wrong filename handling (filename.json.json) – run 4 |
1/4 |
| GPT‑OSS 20B | Same filename bug (story.json.json) – run 2 |
1/4 |
| Both | Off‑by‑one word count in one run (4650 vs. 4651) | 1/4 each |
All other runs generated a complete JSON object with num_words, processing_time, and the full token array. However, some runs of Llama3.3:70b-instruct produced correct but unreadable (by humans) JSON code.
6. Re‑creating the original text file
| Model | Mistake(s) | How obvious was it? |
|---|---|---|
| Llama 3.3 70B | In run 4 the function added a newline after every token (fwrite($file, $token->t . "\n");). This produced a file with extra blank lines. |
Visible immediately when diff‑ing with the source. |
| GPT‑OSS 20B | Run 2 wrote the secondary file as story.json-2.txt (missing the “‑2” before the extension). |
Minor, but broke the naming convention. |
| Both | All other runs reproduced the file correctly. | — |
7. Readability & developer experience
7.1. Llama 3.3 70B
Pros
- Generates usable code quickly once the first token appears.
- Handles most of the prompt correctly (JSON, tokenisation, analytics).
Cons
- Inconsistent naming and variable choices across runs.
- Sparse comments – often just a single line like “// Calculate analytics”.
- Occasionally introduces subtle bugs (extra newlines, wrong filename).
- Useless comments after the code. It's more conversational.
7.2. GPT‑OSS 20B
Pros
- Very thorough comments, broken into numbered sections that match the original spec.
- Helpful “tips” mapped to numbered sections in the code (e.g., regex explanation for word cleaning).
- Helpful after-code overview which reference numbered sections in the code. This is almost a game changer, just by itself.
- Consistent logic and naming across runs (reliable!)
- Consistent and sane levels of error handling (
die()with clear messages).
Cons
- None worth mentioning
8. “Instruct” variant of Llama 3.3 (quick note)
I also tried llama3.3:70b‑instruct‑q8_0 (4 runs).
- Latency: highest 30 s – 1 min to first token, ~2 to 3 min total.
- Code length similar to the regular 70 B model.
- Two runs omitted newlines in the regenerated text (making it unreadable).
- None of the runs correctly handled the output filename (all clobbered
story-2.txt).
Conclusion: the plain llama3.3 70B remains the better choice of the two Llama variants for this task.
9. Verdict – which model should you pick?
| Decision factor | Llama 3.3 70B | GPT‑OSS 20B |
|---|---|---|
| Speed | Slower start, still < 2 min total. | Faster start, sub‑minute total. |
| Code size | Compact, but sometimes cryptic. | Verbose, but self‑documenting. |
| Reliability | 75 % correct JSON / filenames. | 75 % correct JSON / filenames. |
| Readability | Minimal comments, more post‑generation tinkering. | Rich comments, easier to hand‑off. |
| Overall “plug‑and‑play” | Good if you tolerate a bit of cleanup. | Better if you value clear documentation out‑of‑the‑box. |
My personal take: I’ll keep Llama 3.3 70B in my toolbox for quick one‑offs, but for any serious PHP scaffolding I’ll reach for GPT‑OSS 20B (or the 120B variant if I can spare a few extra seconds).
10. Bonus round – GPT‑OSS 120B
TL;DR – The 120‑billion‑parameter variant behaves like the 20 B model but is a bit slower and produces more and *better code and commentary. Accuracy goes up. (≈ 100 % correct JSON / filenames).
| Metric | GPT‑OSS 20B | GPT‑OSS 120B |
|---|---|---|
| First‑token latency | ~15 s | ≈ 30 s (roughly double) |
| Total generation time | ~40 s | ≈ 1 min 15 s |
| Average SLOC | 165 ± 20 | 190 ± 25 (≈ 15 % larger) |
| JSON‑filename bug | 1/4 runs | 0/4 runs |
| Extra‑newline bug | 0/4 runs | 0/4 runs |
| Comment depth | Detailed, numbered sections | Very detailed – includes extra “performance‑notes” sections and inline type hints |
| Readability | Good | Excellent – the code seems clearer and the extra comments really help |
12.1. What changed compared with the 20 B version?
- Latency: The larger model needs roughly twice the time to emit the first token. Once it starts, the per‑token speed is similar, so the overall time is only 10-30 s longer.
- Code size: The 120 B model adds a few more helper functions (e.g.,
sanitize_word(),format_elapsed_time()) and extra inline documentation. The extra lines are mostly comments, not logic. - Bug pattern: All of the same bugs that appeared in the 20 B runs re‑appear (wrong
*.json.jsonfilename, correct “‑2” suffix). In addition, run 3 introduced an unwanted newline when rebuilding the original file:
```php function rebuild_text(array $tokens, string $filename): void { $fh = fopen($filename, 'w'); foreach ($tokens as $tok) { // The 120B model mistakenly adds "\n" here in one of the runs fwrite($fh, $tok->t . "\n"); } fclose($fh); }
11. Bottom line
Both Llama 3.3 70B and GPT‑OSS 20B can solve the same PHP coding problem, but they do it with different trade‑offs:
- Llama 3.3 70B – smaller, faster to read once generated, but inconsistent and occasionally buggy.
- GPT‑OSS 20B – larger, more verbose, but gives you a ready‑to‑read design document in the code itself.
If you need quick scaffolding and are comfortable fixing a few quirks, Llama 3.3 70B is still a solid free option.
If you prefer well‑commented, “just‑drop‑in” code and can tolerate a slightly larger file, GPT‑OSS 20B (or its 120B sibling) is the way to go.
Happy coding, and may your prompts be clear!
MrMisterShin@reddit
What quantisation was you running Llama 3.3 70b?
Did you apply any KV cache quantisation?
Did you adjust the temperature at all?
What reasoning effort was used for GPT OSS 20b, (low, medium or high)?
Long_comment_san@reddit
Nice, it's cool to read about real world tasks like that. I don't code as a job but with the quality of recent models I just might. I hope people poking won't discourage you.
AppledogHu@reddit (OP)
No it's great! It seems like lm studio is better than ollama for some tasks. I'm actually surprised though since I read ollama is preferred for developers. I don't find any issues with it. I'll try LM studio later, of course. The thing is, I discovered that both of the models I reviewed are *good enough* for what I need. I'm sure there are better models but it is interesting to me that it probably doesn't matter as much as some people think. I have a sneaking suspicion that using a model aligned with your task, and using them in the right workflow is more important than having the best and latest model. That's why I compared llama3.3:70b and gpt-oss:20b. As it turns out, gpt-oss is way more aligned for coding than llama. I find that very interesting.
No_Gold_8001@reddit
It makes no sense to say that ollama is better to some tasks than lmstudio… why do you think it is better for “tooling and developers”?
Both will be running the same llama.cpp backend. Behind the curtains they are literally the same thing.
AppledogHu@reddit (OP)
If it's running the same backend then why do some people say it's better or worse? Also, I looked around and couldn't find any real comparisons, reviews or even benchmarks featuring these LLM models. There are a few, but unless you know where I can find one I'm left to perform such work myself. As an example, I found gpt-oss produces much better code than qwen3-coder. I was surprised. I wonder if it's different on a language-by-language basis?
No_Gold_8001@reddit
You can configure llama.cpp in many ways.. Lmstudio allows you to select stuff more easily, ollama requires you to set those via env vars but it is also not as obvious and imho they have some bad naming conventions and defaults that make stuff very confusing for newcomers. That is the reason why I recommend lmstudio over ollama. For non mac machines they are running the same software but the lmstudio is easier to manage.
vLLM is in a different class of software. You normally use llama.cpp because it is easy and allows you to offload the model partially for the ram and cpu. vLLM is kinda of the opposite, it is great at batching and at running models directly from the gpu and gpu alone.
So if you need to support hundreds of users or you batching thousands of requests vLLM will be much faster.
When I am running a models on my mac I use lmstudio with MLX. When I am running in a server for multiple users I use vLLM.
In a very oversimplified way this is the difference.
jacek2023@reddit
Try Qwen Coder 30B or Devstral 24B
simracerman@reddit
Llama.cpp+llama-swap+Openwebui
This combo is excellent.
lumos675@reddit
With ollama you get the worst tps possible. Not to mention it's not user friendly at all. Use Lm studio. Also qwen 3 coder and seed oss 36b are the best models for coding in my opinion. They mostly finish my tasks. Yesterday i needed to do some task and only qwen coder managed to do. I even tried sonnet 4.5 for that task and it could not do it. The day before it also i got another task sonnet did not do it and i did it with minimax new release. So overall i realy don't get the hype behind sonnet at all. And that benchmarks are realy good for themselves. Every task needs different model.
muxxington@reddit
With LM Studio you get the second worst possible. Once again, loud and clear for everyone: Use FOSS!
DinoAmino@reddit
I didn't see mention of the reasoning effort used for GPT-OSS. Should we assume you used the default "medium"?
muxxington@reddit
I stopped reading at "ollama".
AccordingRespect3599@reddit
Not sure why you compare these two models. It doesn't make any sense.