The New York Times is suing OpenAI and Microsoft for copyright infringement

btp@kbin.social · 10 months ago

The New York Times is suing OpenAI and Microsoft for copyright infringement

CJOtheReal@ani.social · 10 months ago

Its not piracy to just webscrap everything for data…

There isn’t a person sitting around and pirating shit, its a Algorithm that takes everything from the internet it can reach.

HarkMahlberg@kbin.social · edit-2 10 months ago

Yeah… That’s not a good defense if you think about it. If someone made a Reddit comment with the entire contents of Discworld (idk, just an example), and OpenAI scraped all of Reddit to train their model, well now they’ve used copyrighted material without paying for a commercial license, and now they’re on the hook. By being unscrupulous about their scraping, they actually open themselves up to more liability than if they were more careful about what they scrape and where.

This is all to say nothing of the fact that several other major companies were caught pants down by training with databases explicitly created by torrenting a ton of books.

https://torrentfreak.com/authors-accuse-openai-of-using-pirate-sites-to-train-chatgpt-230630/

There is no direct evidence that OpenAI used pirate sites to train ChatGPT. That said, it is no secret that some AI projects have trained on pirated material in the past, as an excellent summary from Search Engine Journal highlights.

The mainstream media has picked up this issue too. The Washington Post previously reported that the “C4 data set,” which Google and Facebook used to train their AI models, included Z-Library and various other pirate sites.

PlasterAnalyst@kbin.social · 10 months ago

If I read an article and then I reference it or summarize it myself, that isn’t copyright infringement. There’s no difference if I have a computer do the work for me. It’s fair use.

CJOtheReal@ani.social · 10 months ago

Everyone accuses Open AI of everything. In the end most stuff they do will not be illigal, there are loads of reasons, mainly due to the technical issues involved. You would need a database of every copyrighted stuff to check anything. The computing power requiref for this would be absurdly high.

The demands are idiotic and ridiculous.

And as said they didn’t “train chat GPT on a piracy site” the scraping algorithm put some stuff form there in the training data. There is no person doing that.

HarkMahlberg@kbin.social · 10 months ago

There is no person doing that.

“No one’s responsible, the DAO did it. No humans are liable, just this amorphous, sentient carbon cloud.”

I’ve heard many defenses of AI, some of which I agree with, but “strip mining content off the internet is fine because it’s automated” is easily one of the weakest. It doesn’t pass the sniff test.

If you write a script that downloads every single image from every single website, no questions asked, and then reupload them to various websites at random, do you suppose the police shouldn’t charge you with (inevitably) possessing and distributing CSAM? “Oh no officer, your true culprit is the Dell in my living room! Arrest that box!”

Everyone is, on some level, responsible for the things they create.

EvilMonkeySlayer@kbin.social · 10 months ago

And as said they didn’t “train chat GPT on a piracy site” the scraping algorithm put some stuff form there in the training data. There is no person doing that.

“Your honour my program that I created to slurp up data from the internet using my paid for internet connection, into my AI trained model that I own and control happened to slurp up copyrighted data… I um, it’s not my fault it slurped up copyrighted data even though I put no checks in place for it to check what it was slurping up or from where.”

That is the argument you are putting forth.

Do you think any judge/court of law would view that favourably?

lemonflavoured@kbin.social · 10 months ago

Its not piracy to just webscrap everything for data…

Yes it is.

CJOtheReal@ani.social · 10 months ago

No. It’s publicly available, piracy would be to use stuff that isn’t publicly available.

lemonflavoured@kbin.social · 10 months ago

Publicly available =/= public domain.