

Its a link to my blog post:
https://agamsingh9.codeberg.page/posts/ai-web-scrapers/
Is the link not working?


Its a link to my blog post:
https://agamsingh9.codeberg.page/posts/ai-web-scrapers/
Is the link not working?


No way they put PragerU propaganda videos convincing that Israel is not an apartheid state.


I was almost done writing my blog post addressing the same issue! I’ll give this a read to make sure I don’t re-say anything you did. The state site generator I use (Hugo) also looks very similar to yours lol.


Wish they’d move to the Fediverse instead of people uninstalling TikTok to install some other app that could potentially change its algorithm or funding at any time.


The EU should follow economic non-cooperation against the US and start moving towards making China a partner. They should get used to the NWO.
Gave them money as in a donation or paying for the app? Haven’t used Grayjay, but I’ve payed for Immich and FUTO keyboard at least. They were pretty clear on the apps that it costs money and not just a donation.
Regardless of if you like a company or not, I don’t think you should steal from them. Either pay for the software because you think it’s good software or move to another app.
Just my 2¢.


MAGA: Trump kindapping Maduro was justified because he was acting like a dictator and was a net negative to the people of Venezuela!
MAGA after another world leader Kindaps Trump for doing the same thing: 😮


I agree on a moral standpoint, but unfortunately this does not hold up legally. Even for licenses specifically targeted in addressing AI outputs to count as derivative works like RAIL, I couldn’t find any case of it holding up in a US court. The best course of action might just be to add bot-filtering to whatever Git instance you host your copylefted works on until this issue has a legal solution. I’m curious on the FSF’s stance on AI output counting as derived works and if they’d ever consider a GPLv4 or new license to explicitly target AI. Couldn’t find anything online.


God, this post makes me so mad.
I understand that not everyone has the privilege to distribute knowledge for social good. I’m in a privileged position–my day job provides more than enough money for a dignified life, so my own code I release is almost always strong-copylefted and for genuine social good rather than survival.
Seeing so many posts thinking a proper “solution” to web scraping for AI training is closing off knowledge by default worries me. Gatekeeping code/art/knowledge shrinks the commons that made all of this possible. Nobody owes us attention, brand recognition, or monetization. Free Open Source Software exists to protect society’s freedom to study, modify, and share the tools it depends on for social good, not for monetization or attention.
I noticed OP used Micro$oft’s GitHub, notorious for mass AI crawling. You can’t rely on THE worst platform for scraping and then complain about it. Host using Forgejo or similar, and use solutions that don’t restrict user freedoms: bot filtering, rate limits, pay-per-crawl, etc.
I think the root problem is that in capitalism, markets often don’t sustainably fund public goods–but that’s a political problem–not something individual maintainers should solve by privatizing knowledge. Continue to vote for and spread leftist ideas of restructuring society to encourage funding of public goods like Free Open Source Software rather than giving up and abandoning your FOSS values.


Technically the act of incorporating code into a model’s weights does not trigger GPL’s redistribution clause, so they are legally in the right even though morally you shouldn’t scrape copylefted code into a model that can be used to create non-copylefted code.
A few years? I thought alpha would be out this year


I like the little badge I get for being a donar on Signal. Other than that, my default is $5/mo or whatever the site’s recommended amount is.


I tried Deepseek R1 7B on my MacBook M3 Pro but it is shit compared to ChatGPT unfortunately
Lemmy algorithm is based as hell though since purely shows you what is popular since it is not from a profit-driven source. Big tech closed source algorithms are designed from profit incentives because of capitalism. When people are talking about “The algorithm”, they usually mean the latter. There’s nothing wrong with getting your news from Lemmy.


Well, Reddit is useful. I used it before I was 16 for sure. There are useful subreddits like r/SAT.


iot bad c/StallmanWasRight


Is this hosted somewhere? Maybe distributed? I would love a privacy respecting distributed LLM chatbot.
I am seeing from these comments that my proposed solution was pretty naïve. I intended this blog post to be a sort of thought experiment to challenge assumptions made about the web pre-AI rather than my thought up technical solution to be the main focus.
I might go back and re-write some of this post to gear the focus more towards my main points of the social contract between bot and human shifting (especially with copyleft/ShareAlike), the web becoming less “open”, how this is not a new idea since the DMCA already considers standards for automated access, etc.