Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

Cooper8@feddit.online · 21 hours ago

Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

partofthevoice@lemmy.zip · 14 hours ago

That’s news to me, unless you’re only referring to the smaller models. Any chance you can run a model that exceeds your ram capacity yet?

Cethin@lemmy.zip · 10 hours ago

This is probably the easiest tool I’ve used to run them: https://lmstudio.ai/

There’s tons of models available here, some of them fairly large: https://huggingface.co/

No, I’m pretty sure there’s no way to run any larger than your RAM/VRAM, at least not automatically. You can use storage as RAM, but that’s probably not a good idea. It’s orders of magnitude slower. You’re better off running a smaller model.

partofthevoice@lemmy.zip · 10 hours ago

I’m not knowledgeable in this area, but I wish there was a way to partition the model and stream the partitions over the input, allowing for some kind of serially processing of models that do exceed memory. Like if I could allocate 32gb of ram, and process a 500gb model but at (500/32) a 15x slower rate.

m532@lemmygrad.ml · 5 hours ago

It would need to load every part of the model from disk into ram for every token it generates. This would take ages.

What you can do, however, is quantize the model. If you, for example, quantize a 16-bit model into 4-bit, its storage and ram requirements will go down to 1/4. While the calculations will still be in 16-bit, the weights will lose some accuracy.

Cethin@lemmy.zip · 6 hours ago

The way that could be done would be significantly worse than 15 slower. That’s the issue. Even with the fastest storage, moving things between RAM and storage creates massive bottlenecks.

There are ways to reduce this overhead by intelligently timing moving pieces between storage and RAM, but storage is slow. I don’t know how the models work, if it is possible to know what will be needed soon, so you can start moving it into RAM before it’s needed. If that can be done then it wouldn’t be impossibly bad, but if it can’t then we’re talking something like 100x slower maybe. Most of these are already pretty slow on consumer hardware, so that’d be effectively unusable. You’d be waiting hours for responses.

Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

Switzerland government release full FOSS LLM under Apache 2.0, argue for AI as Public Utility

Apertus: a fully open, transparent, multilingual language model