• mindbleach@sh.itjust.works
    link
    fedilink
    arrow-up
    2
    ·
    1 hour ago

    There are other ways it might work, like if there is a method of compression that is discovered that reduces the necessary RAM and Compute needs by 2-3 orders of magnitude. So models that are considered very large today (100-300 billion params at full quality) might be able to run effectively on a single 32GB GPU that costs a few thousand dollars.

    You might want to check in on how well distilled / quantized models are doing, compared to gigundo datacenter versions.