Training "AI" On Public Data Is Totally Fine And Not Stealing.

31337@sh.itjust.works · 3 months ago

Training "AI" On Public Data Is Totally Fine And Not Stealing.

Melllvar@startrek.website · 3 months ago

The output of a LLM is analogous to re-saving an image as a lo res JPEG. Data is being processed and altered using statistics, but nothing “new” is being created, only lower quality derivatives. That’s why you can’t train a LLM on the output of a LLM.

31337@sh.itjust.works · 3 months ago

This is actually a decent argument, but there has to be a threshold. For instance, if I take the average of all RGB values in an image, and distribute a pixel with the average, is that breaking copyright or somehow immoral?

I recently looked into the speculated model-size and speculated training set size of GPT and Stable Diffusion, and it does appear that if you thought of them as compression algorithms, they’d only be doing something like 1:7 compression. These ratios aren’t outlandish for lossy compression.

Compression and redistribution isn’t the (stated) goal of these models. Hypothetically, these models are learning patterns and associations of things like styles and how humans write text. And they appear to do things a little beyond just copying and pasting. So, hypothetically, a lot of the model size could mostly consist of learned styles and human preferences, rather than just a compressed database of the images it was trained on. I guess the real test is trying to prompt the models to reproduce an item in its training set, and evaluating how similar it is.