NPUs are basically useless for LLMs because no software supports them. They also can’t allocate much memory, and they don’t support the exotic quantization schemes modern runtimes use very well.
And speed wise, they are rather limited by their slow memory busses they’re attached to.
Even on Apple, where there is a little support for running LLMs on NPUs, everyone just does the compute on the GPU anyway because its so much faster and more flexible.
This MIGHT change if bitnet llms take off, or if Inte/AMD start regularly shipping quad channel designs.
Bitnet is theoretical now and unsupported by NPUs anyway.
Basically they are useless for large models :P
The IGPs on the newest AMD/Intel IGPs are OK for hosting models up to like 14B though. Maybe 32B if with the right BIOS, if you don’t mind very slow output.
If I were you, on a 3080, if you keep desktop vram usage VERY minimal, I would run TabbyAPI and a 4bpw exl2 quantization of Qwen 2.5 14B coder, instruct, and RP finetune… pick your flavor. I’d recommend this one in particular.
Not planning on getting a new/additional GPU at this point. My local LLM project is more of curiosity, I am more knowledgeable on the AI upscaling side. :)
Last time bothering you! I used to be really into the GAN space myself, but the newer diffusion models really blow them away. Check this out: https://github.com/mit-han-lab/nunchaku
This can squeeze Flux 1D onto your 3080, and (with the right pipeline/settings) it should blow anything else away at “enhancing” a low res image with img2img. It should also work with batching and torch.compile so you can get quite a lot of throughput from your 3080. Of course, there’s no temporal consistency yet (or it may be, it’s hard to keep up with all the adapter releases), but I’m sure its coming… And you can kinda hack some in with 2D models anyway.
NPUs are basically useless for LLMs because no software supports them. They also can’t allocate much memory, and they don’t support the exotic quantization schemes modern runtimes use very well.
And speed wise, they are rather limited by their slow memory busses they’re attached to.
Even on Apple, where there is a little support for running LLMs on NPUs, everyone just does the compute on the GPU anyway because its so much faster and more flexible.
This MIGHT change if bitnet llms take off, or if Inte/AMD start regularly shipping quad channel designs.
Yes, I was reading through the documentation of some of the tools I use and I noticed minimal info about NPU support.
Will take a look at bitnet (if my tools support it), curious how it would compare to Llama which seems decent for my use cases.
Bitnet is theoretical now and unsupported by NPUs anyway.
Basically they are useless for large models :P
The IGPs on the newest AMD/Intel IGPs are OK for hosting models up to like 14B though. Maybe 32B if with the right BIOS, if you don’t mind very slow output.
If I were you, on a 3080, if you keep desktop vram usage VERY minimal, I would run TabbyAPI and a 4bpw exl2 quantization of Qwen 2.5 14B coder, instruct, and RP finetune… pick your flavor. I’d recommend this one in particular.
https://huggingface.co/bartowski/SuperNova-Medius-exl2/tree/4_25
Run it with Q6 cache and set the context to like 16K, or whatever you can fit in your vram.
I guarantee this will blow away whatever llama (8b) setup you have.
Cheers, will give it a go. I want to move away from cloud LLMs.
Pick up a 3090 if you can!
Then you can combine it with your 3080 and squeeze Qwen 72B in, and straight up beat GPT-4 in some use cases.
Also, TabbyAPI can be tricky to set up, ping me if you need help.
Not planning on getting a new/additional GPU at this point. My local LLM project is more of curiosity, I am more knowledgeable on the AI upscaling side. :)
Thanks for the offer, will consider it!
Last time bothering you! I used to be really into the GAN space myself, but the newer diffusion models really blow them away. Check this out: https://github.com/mit-han-lab/nunchaku
This can squeeze Flux 1D onto your 3080, and (with the right pipeline/settings) it should blow anything else away at “enhancing” a low res image with img2img. It should also work with batching and torch.compile so you can get quite a lot of throughput from your 3080. Of course, there’s no temporal consistency yet (or it may be, it’s hard to keep up with all the adapter releases), but I’m sure its coming… And you can kinda hack some in with 2D models anyway.