AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

yoasif@fedia.io · 2 months ago

AI’s Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

BackgrndNoize@lemmy.world · 2 months ago

So the issue is that AI strips the provenence of the open source contributors and then the output it spits out based on the data it consumed is not subject to the same open source licensing that applies to the open source projects, and these AI companies make profit from this but the open source contributors don’t see a dime. We’ll that’s kinda always been the case though, so many amazing open source projects get coopted by tech giants like Microsoft and then repackaged as proprietary software for a profit, embrace, extend, extinguish, but back then they needed a team of developers to do that, now it’s more automated I guess with AI

yoasif@fedia.io · 2 months ago

Copyleft software isn’t supposed to just be repackaged as proprietary, though. Permissive licenses, sure - but people know what they were signing up for (presumably) there.

rizzothesmall@sh.itjust.works · 2 months ago

These guys: AI bad! It takes jobs!
Also these guys: Check out this thumbnail tho!

yoasif@fedia.io · 2 months ago

That’s the TIME magazine cover, buddy.

atzanteol@sh.itjust.works · 2 months ago

destroy the bargain that made free software spread like wildfire

If you didn’t want your code to be used by others then don’t make it open source.

yoasif@fedia.io · 2 months ago

Do you understand how free software works? Did you read the post? I’d love to clarify, but I’m not going to rewrite the article.

atzanteol@sh.itjust.works · 2 months ago

Also - this conclusion is ridiculous:

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That is absolutely not true. It doesn’t remove the copyright from the original work and no court has ruled as such.

If I wrote a “random code generator” that just happened to create the source code for Microsoft Windows in entirety it wouldn’t strip Microsoft of its copyright.

yoasif@fedia.io · 1 month ago

That is absolutely not true. It doesn’t remove the copyright from the original work and no court has ruled as such.

Sorry, I just got around to this message. That is the idea of the provenance – clearly, the canonical work is copyright. It is the version that has been stripped of its provenance via the LLM that no longer retains its copyright (because as I pointed out, LLM outputs cannot be copyright).

atzanteol@sh.itjust.works · 1 month ago

That doesn’t make it “no longer copy-written” though. The original copyright holder retains their copyright on it. I can’t see any court ruling otherwise.

yoasif@fedia.io · 1 month ago

The output of the LLM can be incorporated into copyrighted material and is copyright free. I never claimed that the copyright on the original work was lost.

atzanteol@sh.itjust.works · 2 months ago

Yes. And this is kinda hand-wavy bullshit.

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That’s not how it works. Your code is not “incorporated” into the model in any recognizable form. It trains a model of vectors. There isn’t a file with your for loop in there though.

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license. So can an LLM.

yoasif@fedia.io · 2 months ago

I can read your code, learn from it, and create my own code with the knowledge gained from your code without violating an OSS license.

Why is Clean-room design a thing then?