• sunzu2@thebrainbin.org
    link
    fedilink
    arrow-up
    18
    ·
    17 hours ago

    I use LLMs, locally.

    They are kinda useful but I don’t understand how they are driving all this DC/hardware investments.

    How are they planning on recouping the money.

    LLMs appear to be stalling. Sure they keep beating benchmarks. But from end user perspective they do the sale thing including the same major flaws, ie you need to know the subject matter or if you are leaning you still need a source material. Otherwise it will tell you anything you want to hear in a confident manner.

    • brucethemoose@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      17 hours ago

      Yeah.

      What on Earth are all these companies doing with racks and racks of hardware? Are they somehow constantly finetuning models for their workflows… for what?


      The Chinese are pretraining, finetuning, and publishing with (comparative) peanuts for hardware. Some random company’s local pretrains are not going to beat the leading edge of open source.

      And finetuning should usually be a one-off cloud runs, not something companies horde hardware for.

      • ByteOnBikes@discuss.online
        link
        fedilink
        English
        arrow-up
        3
        ·
        17 hours ago

        I thought it was just my company who made some major investments in server hardware.

        They put it under the guise of LLMs. But I’m also wondering if it’s because of the US government’s overreach in the past year. CEOs who own Cloud infrastructure have continued to kiss Trump’s ass, and on-prem is now the only thing we can trust.

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          3
          ·
          edit-2
          17 hours ago

          LLM inference requires very specific servers that aren’t good for much else (in terms of what companies usually do), though. And go ‘obsolete’ even more quickly.

          I guess what I’m saying is the premise would be pretty flimsy for a more general upgrade.

    • Alphane Moon@lemmy.worldOPM
      link
      fedilink
      English
      arrow-up
      4
      ·
      17 hours ago

      I’ve played around with local LLMs, but for both work tasks and casual tasks (movies, games, random quick calculations), I find cloud LLMs to be significantly better (I have a 10GB VRAM GPU).

      It does seem like they’ve hit a point of diminishing returns, but that doesn’t seem to have impacted companies’ ability to spend.

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        17 hours ago

        How much RAM do you have?

        You can run GLM 4.5 Air pretty easily and quickly with 10GB. You can queeze in full GLM 4.5 (like 350B parameters), just barely, if you have 128GB DDR5 , which puts you very close to cloud LLMs (albeit at like 6 tokens/sec).

        You also need to hook them up to a ‘research’ front end so they can actually access information. Its all really finicky to set up TBH, and a simple ‘ollama run’ will get you absolutely terrible results.

        • Alphane Moon@lemmy.worldOPM
          link
          fedilink
          English
          arrow-up
          2
          ·
          17 hours ago

          64 GB DDR4 3600.

          Haven’t tried GLM, have only tried various Llama versions. Results were OK for some use cases, atrocious for others.

          6 Token/second sounds pretty slow.

            • Alphane Moon@lemmy.worldOPM
              link
              fedilink
              English
              arrow-up
              2
              ·
              16 hours ago

              Thanks! Will give it a go.

              I initially went with the local approach, but I sort of gave up on it due to subpar results and because I have no plans to upgrade my 3080 10GB in the next ~2 years.

              • brucethemoose@lemmy.world
                link
                fedilink
                English
                arrow-up
                1
                ·
                edit-2
                16 hours ago

                Oh a 3080, perfect.

                The way it works with ik_llama.cpp is the ‘always run’ parts of the LLM will live and run on your 3080 while the sparse, compute light parts (the FFN MoE weights) stay in RAM and run on the CPU. For preprocessing big prompts, it utilizes you 3080 fully even though the model is like 45-55GB, and for the actual token generation they ‘alternate’, where your 3080 and CPU will bounce between being loaded. Hence having a compute-heavy 3080 is good, and 10GB is the perfect size to hold the dense parts and KV cache for GLM 4.5 Air.

                ik_llama.cpp also uses a special quantization format thats better compressed than your typical GGUFs.

                What I’m getting at is running a new 110 billion parameter model vs an older 8B one is like night and day, but you really do need a specialzied software framework to pull it off.


                Also, you can test GLM 4.5 Air here for free to kinda assess what you’re shooting for: https://chat.z.ai/

                The quantized version you run will be slightly dumber, but not by much.

                • Alphane Moon@lemmy.worldOPM
                  link
                  fedilink
                  English
                  arrow-up
                  2
                  ·
                  16 hours ago

                  Pretty sure this sort of hybrid approach wasn’t widely available the last time I was testing local LLMs (8-12 months).

                  Will need to do a test run. Cheers!

  • Kyrgizion@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    18 hours ago

    I upgraded my pc - for the first time in nearly a decade - this spring because I felt in my bones it would be the last time most components would be “affordable”. It may very well actually be.

  • Onno (VK6FLAB)@lemmy.radio
    link
    fedilink
    English
    arrow-up
    8
    ·
    19 hours ago

    TrendForce says lead times for high-capacity “nearline” hard drives have ballooned to over 52 weeks — more than a full year.

    • riquisimo@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      5
      ·
      18 hours ago

      If you’re unfamiliar with the term, “nearline” refers to storage that is not quite online, yet also not quite offline. It’s “warm” data, information that needs to be available for ready access, but doesn’t have to be as quick or responsive as the SSDs that serve as primary online storage for essentially all systems now. Because it isn’t constantly being accessed, hard drives can fill this role in an economical fashion.