You can run GLM 4.5 Air pretty easily and quickly with 10GB. You can queeze in full GLM 4.5 (like 350B parameters), just barely, if you have 128GB DDR5 , which puts you very close to cloud LLMs (albeit at like 6 tokens/sec).
You also need to hook them up to a ‘research’ front end so they can actually access information. Its all really finicky to set up TBH, and a simple ‘ollama run’ will get you absolutely terrible results.
6-7 tokens/s streamed is fine. Thats basically reading speed on newer tokenizers.
That’s for the full 350B LLM squeeze in though. GLM 4.5 air is still way better than llama, especially llama 8b you were probably running. Try following this guide with the iq3_ks or iq4_kss quant: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
I initially went with the local approach, but I sort of gave up on it due to subpar results and because I have no plans to upgrade my 3080 10GB in the next ~2 years.
The way it works with ik_llama.cpp is the ‘always run’ parts of the LLM will live and run on your 3080 while the sparse, compute light parts (the FFN MoE weights) stay in RAM and run on the CPU. For preprocessing big prompts, it utilizes you 3080 fully even though the model is like 45-55GB, and for the actual token generation they ‘alternate’, where your 3080 and CPU will bounce between being loaded. Hence having a compute-heavy 3080 is good, and 10GB is the perfect size to hold the dense parts and KV cache for GLM 4.5 Air.
ik_llama.cpp also uses a special quantization format thats better compressed than your typical GGUFs.
What I’m getting at is running a new 110 billion parameter model vs an older 8B one is like night and day, but you really do need a specialzied software framework to pull it off.
Also, you can test GLM 4.5 Air here for free to kinda assess what you’re shooting for: https://chat.z.ai/
The quantized version you run will be slightly dumber, but not by much.
It was, its just improved massively, and this specific library/fork is very obscure, heh. Good luck!
It’s also very poorly documented, so feel free to poke me if you run into something. I can even make and upload a more optimal quantization for you since I’ve already set that up for myself, anyway.
How much RAM do you have?
You can run GLM 4.5 Air pretty easily and quickly with 10GB. You can queeze in full GLM 4.5 (like 350B parameters), just barely, if you have 128GB DDR5 , which puts you very close to cloud LLMs (albeit at like 6 tokens/sec).
You also need to hook them up to a ‘research’ front end so they can actually access information. Its all really finicky to set up TBH, and a simple ‘ollama run’ will get you absolutely terrible results.
64 GB DDR4 3600.
Haven’t tried GLM, have only tried various Llama versions. Results were OK for some use cases, atrocious for others.
6 Token/second sounds pretty slow.
6-7 tokens/s streamed is fine. Thats basically reading speed on newer tokenizers.
That’s for the full 350B LLM squeeze in though. GLM 4.5 air is still way better than llama, especially llama 8b you were probably running. Try following this guide with the iq3_ks or iq4_kss quant: https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
Thanks! Will give it a go.
I initially went with the local approach, but I sort of gave up on it due to subpar results and because I have no plans to upgrade my 3080 10GB in the next ~2 years.
Oh a 3080, perfect.
The way it works with ik_llama.cpp is the ‘always run’ parts of the LLM will live and run on your 3080 while the sparse, compute light parts (the FFN MoE weights) stay in RAM and run on the CPU. For preprocessing big prompts, it utilizes you 3080 fully even though the model is like 45-55GB, and for the actual token generation they ‘alternate’, where your 3080 and CPU will bounce between being loaded. Hence having a compute-heavy 3080 is good, and 10GB is the perfect size to hold the dense parts and KV cache for GLM 4.5 Air.
ik_llama.cpp also uses a special quantization format thats better compressed than your typical GGUFs.
What I’m getting at is running a new 110 billion parameter model vs an older 8B one is like night and day, but you really do need a specialzied software framework to pull it off.
Also, you can test GLM 4.5 Air here for free to kinda assess what you’re shooting for: https://chat.z.ai/
The quantized version you run will be slightly dumber, but not by much.
Pretty sure this sort of hybrid approach wasn’t widely available the last time I was testing local LLMs (8-12 months).
Will need to do a test run. Cheers!
It was, its just improved massively, and this specific library/fork is very obscure, heh. Good luck!
It’s also very poorly documented, so feel free to poke me if you run into something. I can even make and upload a more optimal quantization for you since I’ve already set that up for myself, anyway.