this post was submitted on 17 Feb 2024
267 points (100.0% liked)

Technology

37801 readers
246 users here now

A nice place to discuss rumors, happenings, innovations, and challenges in the technology sphere. We also welcome discussions on the intersections of technology and society. If it’s technological news or discussion of technology, it probably belongs here.

Remember the overriding ethos on Beehaw: Be(e) Nice. Each user you encounter here is a person, and should be treated with kindness (even if they’re wrong, or use a Linux distro you don’t like). Personal attacks will not be tolerated.

Subcommunities on Beehaw:


This community's icon was made by Aaron Schneider, under the CC-BY-NC-SA 4.0 license.

founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 10 months ago (1 children)

Creating a new instance only gets you access to content that users of your instance have subscribed to, and then mostly only content that comes in after subscription (I believe Lemmy primes the pump a bit on community subs, pulling in a handful of posts at the time of discovery, but discovery is done by users). So, there's a limit on what you can scrape with your own private instance, and you're taking a bit of a bet on which communities will yield what you're looking for in the future.

It'd be easier and more reliable to just crawl the network and scrape it the old fashion way.

[–] [email protected] 1 points 10 months ago

"If you search for a community first time, 20 posts are fetched initially. Only if a least one user on your instance subscribes to the remote community, will the community send updates to your instance. Updates include:

New posts, comments
Votes
Post, comment edits and deletions
Mod actions"

So you create a single user and subscribe to all communities of interest.

I probably downplayed the difficulty of setting up a Lemmy instance that will come if you do something out of order or don't quite have the host set up correctly or something. Although I do think it's easier than pigging about with web crawlers.