• 1 Post
  • 115 Comments
Joined 8 months ago
cake
Cake day: March 22nd, 2024

help-circle


  • My level of worry hasn’t lowered in years…

    But honestly? Low on the totem pole. Even with Trumpy governments.

    Things like engagement optimized social media warping people’s minds for profit, the internet outside of apps dying before our eyes, Sam Altman/OpenAI trying to squelch open source generative models so we’re dependent on their Earth burning plans, blatant, open collusion with the govt, everything turning into echo chambers… There are just too many disasters for me to even worry about the government spying on me.

    If I lived in China or Russia, the story would be different. I know, I know. But even now, I’m confident I can given the U.S. president the middle finger in my country, but I’d really be more scared for my life in more authoritarian strongman regions.








  • It’s unbelievavly time inefficient for… anything.

    And its incredibly engaging. I burnt through so much time shooting the breeze in hopes of actually finding something interesting, notification spam, checking channels… It’s why I deleted it from everywhere. And it left a gaping hole in my life, because its the only place some niche communities exist now.


  • It turns out the popular alternative is “force you to sign up (with a phone number) from critical mass/FOMO, track the snot out of you then slide ads in later.” Oh, and the stuff you want is siloed away until you join, and buried in a mountain of rambling and engagement optimization junk.

    Note that I’m largely talking about Discord, which is unfortunately where many of my interests have been shunted off to. People talk about Facebook, Google and OpenAI eating the internet, but I feel like Discord is the quiet trojan horse.





  • Basically the only thing that matters for LLM hosting is VRAM capacity. Hence AMD GPUs can be OK for LLM running, especially if a used 3090/P40 isn’t an option for you. It works fine, and the 7900/6700 are like the only sanely priced 24GB/16GB cards out there.

    I have a 3090, and it’s still a giant pain with wayland, so much that I use my AMD IGP for display output and Nvidia still somehow breaks things. Hence I just do all my gaming in Windows TBH.

    CPU doesn’t matter for llm running, cheap out with a 12600K, 5600, 5700x3d or whatever. And the single-ccd x3d chips are still king for gaming AFAIK.


  • To go into more detail:

    • Exllama is faster than llama.cpp with all other things being equal.

    • exllama’s quantized KV cache implementation is also far superior, and nearly lossless at Q4 while llama.cpp is nearly unusable at Q4 (and needs to be turned up to Q5_1/Q4_0 or Q8_0/Q4_1 for good quality)

    • With ollama specifically, you get locked out of a lot of knobs like this enhanced llama.cpp KV cache quantization, more advanced quantization (like iMatrix IQ quantizations or the ARM/AVX optimized Q4_0_4_4/Q4_0_8_8 quantizations), advanced sampling like DRY, batched inference and such.

    It’s not evidence or options… it’s missing features, thats my big issue with ollama. I simply get far worse, and far slower, LLM responses out of ollama than tabbyAPI/EXUI on the same hardware, and there’s no way around it.

    Also, I’ve been frustrated with implementation bugs in llama.cpp specifically, like how llama 3.1 (for instance) was bugged past 8K at launch because it doesn’t properly support its rope scaling. Ollama inherits all these quirks.

    I don’t want to go into the issues I have with the ollama devs behavior though, as that’s way more subjective.


  • It’s less optimal.

    On a 3090, I simply can’t run Command-R or Qwen 2.5 34B well at 64K-80K context with ollama. Its slow even at lower context, the lack of DRY sampling and some other things majorly hit quality.

    Ollama is meant to be turnkey, and thats fine, but LLMs are extremely resource intense. Sometimes the manual setup/configuration is worth it to squeeze out every ounce of extra performance and quantization quality.

    Even on CPU-only setups, you are missing out on (for instance) the CPU-optimized quantizations llama.cpp offers now, or the more advanced sampling kobold.cpp offers, or more fine grained tuning of flash attention configs, or batched inference, just to start.

    And as I hinted at, I don’t like some other aspects of ollama, like how they “leech” off llama.cpp and kinda hide the association without contributing upstream, some hype and controversies in the past, and hints that they may be cooking up something commercial.




  • Your post is suggesting that the same models with the same parameters generate different result when run on different backends

    Yes… sort of. Different backends support different quantization schemes, for both the weights and the KV cache (the context). There are all sorts of tradeoffs.

    There are even more exotic weight quantization schemes (ALQM, VPTQ) that are much more VRAM efficient than llama.cpp or exllama, but I skipped mentioning them (unless somedone asked) because they’re so clunky to setup.

    Different backends also support different samplers. exllama and kobold.cpp tend to be at the cutting edge of this, with things like DRY for better long-form generation or grammar.