Hey guys,
What’s currently the best LLM for low-VRAM machines with only 6 GB VRAM? I’ve got 32GB RAM as well.
I’m experimenting a little with SillyTavern and I’m curious which model gets the most out of my setup. Should be multilingual and suitable for “casual chatting”.
I know I will probably not get very far with this, but I’m still interested in how far we’ve already come.
(Using KoboldCPP if that matters).
~sp3ctre


There’s been a few videos on Youtube lately discussing using a particular Qwen model that lets you load only particular expert sections at a time onto the GPU and the rest in RAM. This one was the first I watched (https://www.youtube.com/watch?v=8F_5pdcD3HY), I haven’t tried it, but it makes sense on why it would work.