Technology

59600 readers

3186 users here now

This is a most excellent place for technology news and articles.

Our Rules

Follow the lemmy.world rules.
Only tech related content.
Be excellent to each another!
Mod approved content bots can post up to 10 articles per day.
Threads asking for personal tech support may be deleted.
Politics threads may be removed.
No memes allowed as posts, OK to post as comments.
Only approved bots from the list below, to ask if your bot can be added please contact us.
Check for duplicates before posting, duplicates may be removed

Approved Bots

founded 1 year ago

MODERATORS

[email protected]

574

Reddit Will License Its Data to Train LLMs, So We Made a Firefox Extension That Lets You Replace Your Comments With Any (Non-Copyrighted) Text (theluddite.org)

submitted 8 months ago by [email protected] to c/[email protected]

66 comments fedilink hide all child comments

I know there are other ways of accomplishing that, but this might be a convenient way of doing it. I'm wondering though if Reddit is still reverting these changes?

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 39 points 8 months ago (5 children)

Reddit is almost certainly going to throw your old comments to them if you edit stuff. We're pretty fucked. And if you think Lemmy is any different, guess again. We agreed to send our comments to everyone else in the fediverse, plenty of bad actors and a legal minefield allows LLMs to do what they want essentially. The good news is that LLMs are all crap, and people are slowly realising this

[–] [email protected] 51 points 8 months ago (1 children)

And if you think Lemmy is any different, guess again

Lemmy is different, in that the data is not being sold to anyone. Instead, the data is available to anyone.

It's kind of like open source software. Nobody can buy it, cause it's open and free to be used by anyone. Nobody profits off of it more than anyone else - nobody has an advantage over anyone else.

Open source levels the playing field by making useful code available to everyone. You can think of comments and posts on the Fediverse in the same way - nobody can buy that data, because it's open and free to be used by anyone. Nobody profits off of it more than anyone else and nobody has an advantage over anyone else (after all, everyone has access to the same data).

The only problem is if you're okay with your data being out there and available in this way... but if you're not, you probably shouldn't be on the internet at all.

[–] [email protected] 4 points 8 months ago (1 children)

If the post is creative then it's automatically copyrighted in many countries. That doesn't stop people collecting it and using it to train ML (yet).

[–] asret 2 points 8 months ago

[–] [email protected] 8 points 8 months ago* (last edited 8 months ago)

LLMs are all crap, and people are slowly realising this

LLM's have already changed the tech space more than anything else for the last 10 years at least. I get what you're trying to say but that opinion will age like milk.

Edit: made wording clearer

[–] [email protected] 6 points 8 months ago

LLMs are great for anything you’d trust to an 8 year old savant.

It’s great for getting quick snippets of code using languages and methods that have great documentation. I don’t think I’d trust it for real work though

[–] [email protected] 6 points 8 months ago

I've been harping on about this for a while on the fediverse ... private/closed/non-open spaces really ought to be thought about more. Fortunately, lemmy core devs are implementing local only and private communities (local only is already done IIRC).

Yes they introduce their own problems with discovery and gating etc. But now that the internet's "you're the product" stakes have gone beyond what could have been construed as a reasonably transaction, "my attention on an ad ... for a service", to "my mind's products to be aggregated into an energy sucking job replacing AI ... for a service" ... well it's time to normalise closing that door on opportunistic tech capitalists.

[–] [email protected] 2 points 8 months ago

They'll use old comments either way, using an up-to-date dataset means using a dataset already tainted by LLM-generated content. Training a model on its own output is not great.

Incidentally this also makes Lemmy data less valuable, most of Lemmy's popularity came after the rise of LLMs so there's no significant untainted data from before LLMs.