Hey, so first off, this is my first time dabbling with LLMs and most of the information I found myself by rummaging through githubs.

I have a fairly modest set-up, an older gaming laptop with a RTX3060 video card with 6 GB VRAM. I run inside WSL2.

I have had some success running fastchat with the vicuna 7B model, but it’s extremely slow, at roughly 1 word every 2-3 seconds output, with --load-8bit, lest I get a CUDA OOM error. Starts faster at 1-2 words per second but slows to a crawl later on (I suspect it’s because it also uses a bit of the ‘Shared video RAM’ according to the task manager). So I heard about quantization which is supposed to compress models at the cost of some accuracy. Tried ready-quantized models (compatible with the fastchat implementation) from hugginface.co, but I ran into an issue - whenever I’d ask something, the output would be repeated quite a lot. Say I’d say ‘hello’ and I’d get 200 ‘Hello!’ in response. Tried quantizing a model myself with exllamav2 (using some .parquet wikitext files also from hugginface for calibration) and then using fastchat but the problem persists. Endless repeated output. It does work faster, though at the actual generation, so at least that part is going well.

Any ideas on what I’m doing wrong?