Your best local LLM for low-VRAM (6GB)?

sp3ctre@feddit.org · 28 days ago

Your best local LLM for low-VRAM (6GB)?

PetteriPano@lemmy.world · 28 days ago

I’m running gemma-4-e4b on my 8GB machine. I’ll drop down to e2b on CPU. It’s probably the best you’ll get. 140 languages, vision, decent at agentic work. Not great at code.

kata1yst@sh.itjust.works · 28 days ago

I mean, do you need it to be fast? You could probably run a pretty decent 20b model if you are okay with the speed of offloading.

sp3ctre@feddit.org · 28 days ago

Doesn’t necessarily need to be very fast, but I don’t plan to wait a minute for one simple sentence as well :)

Is that possible without tinkering too much?

Multiplexer@discuss.tchncs.de · edit-2 28 days ago

I have a Qwen3.6-35b-a3b model running on a dated desktop machine with 4GB VRAM.
I use 8-bit-quant, but also have 48GB normal RAM.
Delivers ~7tk/s, which is already totally usable for most things.
Tried it on my recent Core-i7 company laptop with 8GB VRAM and got 20tk/s.
Oh, and I am also using KoboldCPP (on a Linux foundation).

sp3ctre@feddit.org · edit-2 28 days ago

I’ll try my luck and download Qwen3.6-35B-A3B-GGUF. Thanks!

Rhaedas@fedia.io · 28 days ago

There’s been a few videos on Youtube lately discussing using a particular Qwen model that lets you load only particular expert sections at a time onto the GPU and the rest in RAM. This one was the first I watched (https://www.youtube.com/watch?v=8F_5pdcD3HY), I haven’t tried it, but it makes sense on why it would work.

lime!@feddit.nu · 28 days ago

with a 20b model on weak hardware you’ll be waiting more like 10 minutes. unless the os clobbers your process for using too much memory.

Denixen@feddit.nu · edit-2 28 days ago

My setup is a laptop with 8 GB vram and 16 gb ram.

I have been using ministral 3b (fast) and 14b (slower but somewhat smarter/capable) via ollama. They work remarkably well considering how small they are.

I have been using it as a text translator, summarizer and assistant for discussing more basic things, including integrating it in pycharm using the ollama assist plugin as a coding assistant.

For autocomplete in pycharm I have to use llama 3.1 8b, since ministral cannot do autocomplete (?).

I can recommend ministral, Mistral are really great at creating small distilled models that have a lot of bang for the parameters they have.

biggerbogboy@sh.itjust.works · 28 days ago

On my MacBook Air m2, I’m currently using Qwen 3.5 4b with 8 bit quantisation, and even at its maximum context length, multiple web search RAGs, and the model being built for vision and reasoning, it only ever hits 4.3gb of memory tops.

I run it though LM Studio, so paired with the fact it’s a Mac, your mileage may vary in terms of how much memory it uses, but it does have [from my experience] an output quality a bit over ChatGPT 4o, and is actually really solid for research purposes if that’s what you’re looking for.

bigbangdangler@reddthat.com · 28 days ago

I have been using qwen-3.5-9b as a general purpose LLM that I can still load up while gaming (on a 16GB card). I never have issues if I’m under 10ish gigs of VRAM usage for the game, so I imagine it should work for your use case.

I’ve been generally happy with the results on everyday reasoning tasks and programming questions.

SuspiciousCarrot78@aussie.zone · edit-2 27 days ago

There are many excellent options - far too many to list. So I will briefly say - there are some really nice 4B models (like Qwen3-4B HIVEMIND, Nanbeige, IBM Granite 3B) which you should be able to run at higher quants (Q6 and up) quite nicely. Of course, there are always newer models (Gemma, Qwen3.6 - soon 3.7) etc.

Best bet is to poke around hugging face, on TheBloke, Unsloth or DavidAUs archives and see what they have in the 3-7B range that tickles your fancy. Don’t immediately jump for the newest releases - the old ones are still good. Qwen3-4B 2507 instruct is still a favourite of mine and more recently Qwen3.5-2B shows promise.

venusaur@lemmy.world · 28 days ago

I’m running Qwen 3.5 4B Q4 on 16GB RAM. Yup, no VRAM haha. 5-6tk/s. Llama.cpp

locuester@lemmy.zip · 28 days ago

Check out unsloth studio. It runs Qwen model local, exposing an endpoint, and I’ve had great success with it.

robber@lemmy.ml · 21 days ago

Late to the party, but this was just released: LiquidAI/LFM2.5-8B-A1B-GGUF

I guess you could fully fit it at Q4 with a little context if you need all the speed you can get, or offload the experts to RAM if you prefer higher quality and/or more context.