

Thank you so much!! I have been putting it off because what I have works but a time will soon come when I’ll want to test new models.
I’m looking for a server but not many parallel calls because I would like to use as much context as I can. When making space for e.g. 4 threads, the context is split and thus 4x as small. With llama 3.1 8b I managed to get 47104 context on the 16GB card (though actually using that much is pretty slow). That’s with KV quant to 8b too. But sometimes I just need that much.
I’ve never tried the llama.cpp directly, thanks for the tip!
Kobold sounds good too but I have some scripts talking to it directly. I’ll read up on that too see if it can do that. I don’t have time now but I’ll do it in the coming days. Thank you!

Gratitude for asking them to fully surrender to Putin’s wishes??