Inference-time drift reduces repetition collapse in frozen Llama-3.1-8B (repo + reproducible script)

Posted by chazc2@reddit | LocalLLaMA | View on Reddit | 2 comments

I stumbled onto an odd behavior while experimenting with inference-only modifications:

By adding a small Gaussian drift term to an untrained fast-weight memory module and feeding it into a frozen Llama-3.1-8B model, long-form repetition collapse was significantly delayed.

No training, no LoRA, no fine-tuning, no KV cache edits. Model weights stay frozen.

This repo includes: - A minimal reproducible experiment (single .py) - A simple wrapper for inference-only usage - A replication thread for logs/results

Not claiming a breakthrough — just sharing something interesting that didn't behave the way theory predicts.

Repo:
https://github.com/chazciii/rd-net

If you try it on other model families (Qwen, Mistral, phi, GPTQ, GGUF, etc.), please share your results.