Llama3.3:70b vs GPT-OSS:20b for PHP Code Generation

Posted by AppledogHu@reddit | LocalLLaMA | View on Reddit | 13 comments

Hi! I like PHP, Javascript, and so forth, and I'm just getting into ollama and trying to figure out which models I should use. So I ran some tests and wrote some long, windy blog posts. I don't want to bore you with those so here's a gpt-oss:120b generated re-write for freshness and readability of what I came up with. Although, I did check it and edit a few things. Welcome to the future!

Title: Llama 3.3 70B vs GPT‑OSS 20B – PHP code‑generation showdown (Ollama + Open‑WebUI)


TL;DR

Feature Llama 3.3 70B GPT‑OSS 20B
First‑token latency 10–30 s ~15 s
Total generation time 1 – 1.5 min ~40 s
Lines of code (average) 95 ± 15 165 ± 20
JSON correctness ✅ 3/4 runs, 1 run wrong filename ✅ 3/4 runs, 1 run wrong filename (story.json.json)
File‑reconstruction ✅ 3/4 runs, 1 run added stray newlines ✅ 3/4 runs, 1 run wrong “‑2” suffix
Comment style Sparse, occasional boiler‑plate Detailed, numbered sections, helpful tips
Overall vibe Good, but inconsistent (variable names, refactoring, whitespace handling) Very readable, well‑commented, slightly larger but easier to understand

Below is a single, cohesive post that walks through the experiment, the numbers, the code differences, and the final verdict.


1. Why I ran the test

I wanted a quick, repeatable way to see how Ollama‑served LLMs handle a real‑world PHP task:

Read a text file, tokenise it, build an array of objects, write a JSON summary, and re‑create the original file.

The prompt was deliberately detailed (file‑name handling, whitespace handling, analytics, etc.) and I fed exactly the same prompt to each model in a fresh chat (no prior context).


2. Test harness

Step What I did
Prompt Same multi‑paragraph description for both models.
Runs per model 4 independent generations (to catch variability).
Environment Ollama + Open‑WebUI (context persists only within a single chat).
Metrics collected • First‑token latency (time to the first visible token)
• Total generation time
• Lines of code (excluding blank lines)
• JSON file correctness
• Re‑generated text file correctness
• Subjective readability of the code/comments.

3. Speed & latency

Model First‑token latency Total time (average)
Llama 3.3 70B 10–30 s (often ~20 s) 1 – 1.5 min
GPT‑OSS 20B ~15 s ~40 s

Even though Llama 3.3 felt “slow to start”, it still finished within a minute and a half. GPT‑OSS was noticeably snappier.


4. Code size & structure

Model Avg. SLOC Notable structural quirks
Llama 3.3 70B 95 ± 15 • Variable names changed between runs (e.g., $outputFilename vs $outBase).
• Some runs used file_put_contents(), others used fopen()/fwrite().
• Inconsistent handling of whitespace tokens in the JSON (sometimes a boolean, sometimes omitted).
GPT‑OSS 20B 165 ± 20 • Heavier commenting (numbered sections, “what‑this‑does” bullet points).
• Consistent use of file_put_contents() for both JSON and text output.
• More explicit error handling.

Overall, I much prefer the kind of commentary and code structure produced by GPT-OSS.

4.1. Sample header comparison

GPT‑OSS 20B (first ~12 lines)

<?php
/**
 * a1.php
 *
 * 1. Reads a text file (via GET or POST: file=)
 * 2. Tokenises it into words & whitespace
 * 3. Builds an array of objects:
 *      - id          : the token number in the file
 *      - t           : the exact token as it appears
 *      - whitespace  : true for whitespace tokens, false otherwise
 *      - w           : processed word (keeps dash/apostrophe, removes punctuation)
 * 4. Stores the array, word‑count and elapsed time in JSON
 * 5. Writes a plain‑text file that contains all obj->t in order,
 *    with “‑2” inserted before the extension.
 *
 * The script is deliberately written step‑by‑step for teaching.
 */

Llama 3.3 70B (first ~12 lines)

<?php
// Combine GET and POST requests
$request = array_merge($_GET, $_POST);
// Check if file is provided
if (!isset($request['file'])) {
    echo "Error: No file provided.";
    exit;
}
$filename = $request['file'];
// Set output filename
if (isset($request['out']) && $request['out'] != '') {
    $output_filename = $request['out'];
} else {
    $parts = explode('.', $filename);
    $output_filename = $parts[0] . '.json';
}

The GPT‑OSS header reads like a short design document, while Llama’s header is non-existant. GPT-OSS wins hands down on structure and commenting.


5. JSON output quality

Both models produced human‑readable JSON in the majority of runs. The main hiccups:

Model Issue Frequency
Llama 3.3 70B Wrong filename handling (filename.json.json) – run 4 1/4
GPT‑OSS 20B Same filename bug (story.json.json) – run 2 1/4
Both Off‑by‑one word count in one run (4650 vs. 4651) 1/4 each

All other runs generated a complete JSON object with num_words, processing_time, and the full token array. However, some runs of Llama3.3:70b-instruct produced correct but unreadable (by humans) JSON code.


6. Re‑creating the original text file

Model Mistake(s) How obvious was it?
Llama 3.3 70B In run 4 the function added a newline after every token (fwrite($file, $token->t . "\n");). This produced a file with extra blank lines. Visible immediately when diff‑ing with the source.
GPT‑OSS 20B Run 2 wrote the secondary file as story.json-2.txt (missing the “‑2” before the extension). Minor, but broke the naming convention.
Both All other runs reproduced the file correctly.

7. Readability & developer experience

7.1. Llama 3.3 70B

Pros

Cons

7.2. GPT‑OSS 20B

Pros

Cons


8. “Instruct” variant of Llama 3.3 (quick note)

I also tried llama3.3:70b‑instruct‑q8_0 (4 runs).

Conclusion: the plain llama3.3 70B remains the better choice of the two Llama variants for this task.


9. Verdict – which model should you pick?

Decision factor Llama 3.3 70B GPT‑OSS 20B
Speed Slower start, still < 2 min total. Faster start, sub‑minute total.
Code size Compact, but sometimes cryptic. Verbose, but self‑documenting.
Reliability 75 % correct JSON / filenames. 75 % correct JSON / filenames.
Readability Minimal comments, more post‑generation tinkering. Rich comments, easier to hand‑off.
Overall “plug‑and‑play” Good if you tolerate a bit of cleanup. Better if you value clear documentation out‑of‑the‑box.

My personal take: I’ll keep Llama 3.3 70B in my toolbox for quick one‑offs, but for any serious PHP scaffolding I’ll reach for GPT‑OSS 20B (or the 120B variant if I can spare a few extra seconds).


10. Bonus round – GPT‑OSS 120B

TL;DR – The 120‑billion‑parameter variant behaves like the 20 B model but is a bit slower and produces more and *better code and commentary. Accuracy goes up. (≈ 100 % correct JSON / filenames).

Metric GPT‑OSS 20B GPT‑OSS 120B
First‑token latency ~15 s ≈ 30 s (roughly double)
Total generation time ~40 s ≈ 1 min 15 s
Average SLOC 165 ± 20 190 ± 25 (≈ 15 % larger)
JSON‑filename bug 1/4 runs 0/4 runs
Extra‑newline bug 0/4 runs 0/4 runs
Comment depth Detailed, numbered sections Very detailed – includes extra “performance‑notes” sections and inline type hints
Readability Good Excellent – the code seems clearer and the extra comments really help

12.1. What changed compared with the 20 B version?

```php function rebuild_text(array $tokens, string $filename): void { $fh = fopen($filename, 'w'); foreach ($tokens as $tok) { // The 120B model mistakenly adds "\n" here in one of the runs fwrite($fh, $tok->t . "\n"); } fclose($fh); }


11. Bottom line

Both Llama 3.3 70B and GPT‑OSS 20B can solve the same PHP coding problem, but they do it with different trade‑offs:

If you need quick scaffolding and are comfortable fixing a few quirks, Llama 3.3 70B is still a solid free option.
If you prefer well‑commented, “just‑drop‑in” code and can tolerate a slightly larger file, GPT‑OSS 20B (or its 120B sibling) is the way to go.

Happy coding, and may your prompts be clear!