I think it's a good idea to share experiences about LLMs here, since benchmarks can only give a very rough overview on how well a model performs.
So please share how much you're using LLMs, what you use them for and how they well they perform at those tasks. For example, here are my answers to these questions:
Usage
I use LLMs daily for work and for random questions that I would previously use web search for.
I mainly use LLMs for reasoning heavy tasks, such as assisting with math or programming. Other frequent tasks include proofreading, helping with bureaucracy, or assisting with writing when it matters.
Models
The one I find most impressive at the moment is TheBloke/airoboros-l2-70B-gpt4-1.4.1-GGML/airoboros-l2-70b-gpt4-1.4.1.ggmlv3.q2_K.bin. It often manages to reason correctly on questions where most other models I tried fail, even though most humans wouldn't. I was surprised that something using only 2.5 bits per weight on average could produce anything but garbage. Downsides are that loading times are rather long, so I wouldn't ask it a question if I didn't want to wait. (Time to first token is almost 50s!). I'd love to hear how bigger quantizations or the unquantized versions perform.
Another one that made a good impression on me is Qwen-7B-Chat (demo). It manages to correctly answer some questions where even some llama2-70b finetunes fail, ~~but so far I'm getting memory leaks when running it on my M1 mac in fp16 mode, so I didn't use it a lot.~~ (this has been fixed it seems!)
All other models I briefly tried where not too useful. It's nice to be able to run them locally, but they were so much worse than chatGPT that it's often not even worth it to consider using them.
By MPS I mean "metal performance shaders", it's the backend that enables pytorch to use apple's metal api to use apple silicon specific optimizations. I actually think it's not unlikely that the issue is with pytorch. The mps support is still beta, and there was a bug that caused a lot of models to output gibberish when I used it. This bug was an open issue for a year and they only just fixed in a recent nightly release, which is why I even bothered to give this model a try.
That being said, I think one should generally be cautious about what to run their computers, so I appreciate that you started this discussion.
Ah, I see. Wouldn't it be pretty easy to determine if MPS is actually the issue by trying to run the model with the non-MPS PyTorch version? Since it's a 7B model, CPU inference should be reasonably fast. If you still get the memory leak, then you'll know it's not MPS at fault.
Without mps it uses a lot more memory, because fp16 is not supported on the cpu backend. However, I tried it and noticed that there was an update pushed to the repository that split the model into several parts. It seems like I'm not getting any memory leaks now, even with mps as backend. Not sure why, but maybe it needs less RAM if the weights can be converted part by part. Time to test this model more I guess!