Replaced $40/month in AI API subscriptions with self-hosted Ollama + n8n

quickbitesdev@discuss.tchncs.de · 2 months ago

Replaced $40/month in AI API subscriptions with self-hosted Ollama + n8n

ℍ𝕂-𝟞𝟝@sopuli.xyz · 2 months ago

I actually did an experiment on doing just that. For context, I’m an experienced software engineer, whose company buys him a tom of Claude usage so I had time to test out what it can actually do and I feel like I’m capable of judging where it’s good and where it falls short at.

How Claude Code works is that there are actually multiple models involved, one for doign the coding, one “reasoning” model to keep the chain of thought and the context going, and a bunch of small specialized ones for odd jobs around the thing.

The thing that doesn’t work yet is that the big reasoning model has to still be big, otherwise it will hallucinate frequently enough to break the workflow. If you could get one of the big models to run locally, you’d be there. However, with recent advances in quantization and MoE models, it’s actually getting nearer fast enough that I would expect it to be generally available in a year or two.

Today the best I could do was a tool that could take 150 gigs of RAM, 24 gigs of VRAM and AMD’s top of the line card to take 30 minutes what takes Claude Code 1-2. But surprisingly, the output of the model was not bad at all.

sobchak@programming.dev · 2 months ago

You really only need a little more RAM than your GPU’s VRAM (unless you’re doing CPU offloading, which is extremely slow). Otherwise, I did the same thing recently too, and was surprised I was able to get it a Qwen 9B to fix a bug in a script I had. I think Sonnet would’ve fixed in a lot fewer tries, but the 9B model was eventually able to fix it. I could’ve fixed it myself quicker and cleaner than both, but it was an interesting test.