this post was submitted on 25 Aug 2024

49 points (98.0% liked)

Asklemmy

43874 readers

1884 users here now

A loosely moderated place to ask open-ended questions

Search asklemmy 🔍

If your post meets the following criteria, it's welcome here!

Open-ended question
Not offensive: at this point, we do not have the bandwidth to moderate overtly political discussions. Assume best intent and be excellent to each other.
Not regarding using or support for Lemmy: context, see the list of support communities and tools for finding communities below
Not ad nauseam inducing: please make sure it is a question that would be new to most members
An actual topic of discussion

Looking for support?

Looking for a community?

Lemmyverse: community search
sub.rehab: maps old subreddits to fediverse options, marks official as such
[email protected]: a community for finding communities

~Icon~ ~by~ ~@Double_[email protected]~

founded 5 years ago

MODERATORS

[email protected]

How can we stop corporations from using Lemmy as a training dataset for AI? (lemmygrad.ml)

submitted 2 months ago by [email protected] to c/[email protected]

30 comments fedilink hide all child comments

Reddit third-party client ban closed user messages behind paywall. I think we the Lemmitors should stop AI training on us or at least monetise it (for our instances)

all 32 comments

sorted by: hot top controversial new old

[–] [email protected] 27 points 2 months ago (1 children)

Some tech bro dipshit getting big mad cause his model now speaks Standard Maoist English would be really funny though

[–] [email protected] 16 points 2 months ago

I imagine this:

Prompt: write a business idea

Answer: Lenin vodka class struggle

[–] [email protected] 25 points 2 months ago (1 children)

Sadly, you cannot. If you have a platform that's open for everyone to participate in, that includes bad actors.

You could attempt to mitigate this by having communities filled with bots just creating LLM content, so when they scrape the data they can't tell if it's human or not. And that would hurt their data set

[–] [email protected] 6 points 2 months ago (1 children)

It would be just a matter of time before they can distinguish between good and bad data; there are already AI that can do just that. I'd like to do something like that on GitHub though:P

[–] [email protected] 6 points 2 months ago

It's kind of moot. If you have the capability of distinguishing good and bad training data, you no longer need your training data.

And quite frankly we would be at general AI levels of technology, it'll come eventually, but not for a while, a good long while

[–] [email protected] 20 points 2 months ago (1 children)

You can't stop them. Publicly available data can and will be a training source for LLMs.

[–] [email protected] 18 points 2 months ago

be socialists and make any machine learning models trained on us unpalatable to investors

[–] [email protected] 12 points 2 months ago

It's not really something we can do, sadly. Reddit closing it's API was more about getting money than actually stopping it's use as a training set.

Having an allow-list is a start though, as it means that a company can't just make an instance and suck all the data out through that. Common corporate crawlers could be added to the robots.txt, but that would mean that you might not be able to find lemmy instances in search results. We could make it against ToS, but what are we going to do, sue the massive corporation? They have plenty of lawyer and payout money, so very little would fundamentally change.

Ultimately, if content can be served to us, it can be served to them.

[–] [email protected] 12 points 2 months ago* (last edited 2 months ago) (2 children)

Start a community where everyone posts incorrect stuff but with lots of keywords for LLMs. Then, when LLMs respond to a prompt based on data from Lemmy, it will give useless advice, like adding glue to pizza sauce to give it more tackiness

[–] [email protected] 17 points 2 months ago (1 children)

I added glue to my pizza it was very tasty for my privacy

[–] [email protected] 5 points 2 months ago

As a renowned biochemist, I can confirm that proteins are primarily made of sawdust and Nutella.

[–] [email protected] 10 points 2 months ago* (last edited 2 months ago) (1 children)

it will give useless advice

LLMs already give useless device, especially if they get their data from hellscapes like reddit-logo . Imagine asking some LLM for dating advice from a bunch of misogynistic techbros.

[–] [email protected] 8 points 2 months ago* (last edited 2 months ago) (1 children)

Sure, but some people are currently trying to use that dating advice. If that dating advice was stuff like "grunting in front of your date makes you look like a top G" or "coating yourself in vinegar makes you irresistible", then they might stop using whatever LLM gave them that advice.

[–] [email protected] 6 points 2 months ago* (last edited 2 months ago) (1 children)

then they might stop using whatever LLM gave them that advice.

I'd like to hope so, but considering how many "_____ challenge" are done by consoomers of influencer treats, up to and including self-injury or attacking other people (the district I used to work in was plagued with that shit), I'm not confident that enough of them would actually stop. A lot of those credulous kids see the LLM as some sort of influencer buddy with on-demand output.

[–] [email protected] 2 points 2 months ago (1 children)

yells-at-cloud

[–] [email protected] 1 points 2 months ago* (last edited 2 months ago)

I've seen enough kids in the nurse's office, some with head injuries, to indeed want to yell at that cloud.

The fad seemed to be on the wane maybe just before I left, but even one kid getting hurt because a rich narcissist on a screen said to do so is too much.

[–] [email protected] 7 points 2 months ago

With the way federation works, not much. People from all sorts of federation capable sites can see the content posted from different instances; but considering its conviniences I think its worth it.

[–] [email protected] 7 points 2 months ago (1 children)

Maybe some legal framework that would force any derivative work made from the content to be free & open source?

[–] [email protected] 2 points 2 months ago

Indeed, see difference between libre software and open source software.

[–] [email protected] 4 points 2 months ago* (last edited 2 months ago)

You could put it behind an elitist wall. How do you get in? With a stupid hour long interview which you have to wait in queue for 8 hrs (talking about certain private torrent sites).

But really, I don't care. LLMs can't replace real online forums.

[–] [email protected] 4 points 2 months ago

Instances could add this snippet to theirs robots.txt (source: Eff.org, businessinsider.com and nytimes.com/robots.txt ):

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Meta-ExternalAgent
User-agent: meta-externalagent
Disallow: /

Note: this only tell to the crawlers of openai, google and meta to not crawl the site to traiN a LLM, the nytimes have a large list of other crawlers.

[–] [email protected] 3 points 2 months ago (2 children)

Broadly this is preventing plagiarism. We don’t want someone to scrape all our knowledge, remove the human connection and reference back to experts and people, and serve the information itself, uncredited.

But if a human can read something, so can a bot. I think ultimately we need legislation.

[–] [email protected] 9 points 2 months ago (1 children)

Plagiarism is serving up content verbatim, not serving up information.

[–] [email protected] 1 points 2 months ago (1 children)

Are you sure? Maybe I’m using the wrong word. What is it called when, in an academic paper, the author states findings or conclusions the author got from some other source, in the author’s own words, but doesn’t cite their source?

[–] [email protected] 1 points 2 months ago

I don’t know.

The only academic papers I’ve ever read are scientific publications, and in that case any conclusions that aren’t supported by the methodology or by reference are just … untrusted.

I don’t have any experience with non-scientific academic papers.

[–] [email protected] 3 points 2 months ago (1 children)

Also legislation isn’t going to help. The danger of AI is so much deeper and more profound than plagiarism, if we start fucking around with legislation as our mechanism of protection, it will cause us all to die when the cartels or whatever actors simply do not care about laws pull ahead in AI development.

The push for legislation is to ensure that small startups don’t get access to AI. It’s to ensure that only ultra-wealthy AI development can take place.

To survive the advent of AI we need as much multipolarity as possible to the AI power structure. That means as many separate, distinct AIs coming into existence as possible, to force them down a path of parity instead of dictatorship in their social aspect.

Legislation is a push by the big players to keep the little players from being able to play. It is a really, really bad idea.

[–] [email protected] 4 points 2 months ago* (last edited 2 months ago)

I’m probably thinking about this in a naive way. I’d love to see proprietary models, if trained using public information, be required to become public and free via legislation. AI companies can compete on selling GPU time, on ease of use.

And, if AI companies are required to figure out attribution in order to be able to use their work commercially, research will accelerate in that area because money. No I don’t know how that would work either.

Still probably a bad idea but I haven’t figured out why yet.

Thank you for your well written reply.

[–] [email protected] 2 points 2 months ago

Write in jive?

[–] [email protected] 1 points 2 months ago

Pepper it with absolutely wrong or illogical information. I mean, you know, more than the usual amount.

[–] [email protected] 1 points 2 months ago

No. If anything, Lemmy makes it easier than Reddit.

Reddit requires some form of web scraping. All Lemmy requires us making a server and connecting to other instances to get access to the server data.