Built a full cognitive architecture on a CPU-only mini tower no GPU, 50-thread stress test passed, here’s the stack
Posted by Interesting_Time6301@reddit | hardware | View on Reddit | 3 comments
Hardware first because that’s what matters here:
Primary dev machine: 2012 Dell Inspiron. Migrated to a CPU-only OmniSlim mini tower mid-build. No GPU. No cloud compute during development. No institutional resources.
LLM routing stack:
• Gemini 2.0 Flash — primary
• Groq — first fallback
• Ollama qwen3:4b — offline fallback, CPU-bound
Ollama hits 500%+ CPU under load on the OmniSlim. The dynamic timeout formula scales with payload: max(60, 20 + (chars/1000 × 25)) seconds. Ablation mode locks at 150s fixed.
Stress test results on CPU-only hardware:
• 5 threads / 20 requests: 100% success, avg 21.10s
• 50 threads / 50 requests: 100% success, avg 107.94s, P95 143.13s
• Bimodal distribution reveals API batch boundary at \\~39 concurrent requests
• Sub-linear scaling: 10x concurrency → 5.1x latency increase
Memory retrieval uses salience-weighted time-decay instead of cosine similarity: MPS = exp(-t/τ) × reinforcement × contextual × extra. Ablation confirmed 14.8% more context per prompt than cosine-only RAG. On CPU hardware that also means 45.4% lower latency cosine RAG was slower because it bloated the prompt.
18,471 lines Python, 55 modules, 199/202 tests passing. ChromaDB for hybrid retrieval, SQLite for state persistence, FastAPI for the API layer.
Full paper: https://zenodo.org/records/20350249
Code: https://github.com/timeless-hayoka/infj-bot
Happy to talk Ollama optimization, the fallback chain architecture, or CPU-bound inference strategies.
hardware-ModTeam@reddit
Thank you for your submission! Unfortunately, your submission has been removed for the following reason:
100GHz@reddit
"fully cognitive"
Do people even look at AI output lately, or is it like "use big words and then git push for the publish"?
Interesting_Time6301@reddit (OP)
Check it out brudah