this post was submitted on 18 Nov 2024

56 points (95.2% liked)

Piracy: ꜱᴀɪʟ ᴛʜᴇ ʜɪɢʜ ꜱᴇᴀꜱ

55076 readers

465 users here now

⚓ Dedicated to the discussion of digital piracy, including ethical problems and legal advancements.

Rules • Full Version

1. Posts must be related to the discussion of digital piracy

2. Don't request invites, trade, sell, or self-promote

3. Don't request or link to specific pirated titles, including DMs

4. Don't submit low-quality posts, be entitled, or harass others

Loot, Pillage, & Plunder

📜 c/Piracy Wiki (Community Edition):

💰 Please help cover server costs.


Ko-fi	Liberapay

founded 2 years ago

MODERATORS

[email protected]

How do I use Open Source scrapers? (Selenium, Scrapy, etc.) (lemmy.dbzer0.com)

submitted 1 month ago* (last edited 1 month ago) by [email protected] to c/[email protected]

39 comments fedilink hide all child comments

I have been trying for hours to figure this out. From a building tutorial to just trying to find prebuilt ones, I can't seem to make it click.

For context I am trying to scrape books myself that I can't seem to find elsewhere so I can use and post them for others.

The scraper tutorial

Hackernoon tutorial by Ethan Jarell

I initially tried to follow this but I kept having a "couldn't find module" error. Since I have never touched python prior to this, I am unaware how to fix this and the help links are not exactly helpful. If there's someone who could guide me through this tutorial that would be great.

Selenium

Selenium Homepage

I don't really get what this is but I think its some sort of python pack and it tells me to download using the pip command but that doesn't seem to work (syntax error). I don't know how to manually add it in because, again, I have little idea of what I'm doing.

Scrapy

Scrapy Homepage

This one seemed like it'd be an out-of-box deal but not only does it need the pip command to download but it has like 5 other dependencies it needs to function which complicates it more for me.

I am not criticizing these wares, I am just asking for help and if someone could help with the simplification of it all or maybe even point me to an easier method that would be amazing!

Updates

Figured out that I am supposed to run the command for pip in the command prompt thing on my computer, not the python runner. py -m followed by the pip request

Got the Ethan Jarrell tutorial to work and managed to add in selenium, which made me realize that selenium isn't really helpful with the project. rip xP
Spent a bunch of time trying to workshop the basic scraper to work with dynamic sites, unsuccessful
Online self-help doesn't go in as much as I would like, probably due to the legal grey area

top 39 comments

sorted by: hot top controversial new old

[–] [email protected] 12 points 1 month ago (1 children)

I have quite an extensive history of scraping web sites for various data over the years, I'd be happy to help you out but I can't really know how to help without knowing what website your trying to scrape, different sites have their own challenges (maybe behind a login, or using JavaScript to load content - in which case a http response won't give you what you're after, or any number of things really).

If you give me a link to a book you want to download as an example I can take a look and help guide you through it

[–] [email protected] 6 points 1 month ago

100% this. Every website is different, though after doing this kind of thing for long enough, there are often common patterns and frameworks/libraries. Even general obfuscation can be reasonably reverse engineered with enough time and effort.

[–] [email protected] 11 points 1 month ago (1 children)

It needs a driver and the web-browser to be executed in headless mode. For Chrome that's chrome-driver. You can get it here.

To make a script for it, I recommend talking to a LLM. I have asked it to build scrapers before, so it does the job.

If you want a practical use of Selenium being demonstrated, you can see it in LucidWebSearch plugin for Oobabooga.

[–] [email protected] 6 points 1 month ago (2 children)

I recommend talking to a LLM

Any recommendations? Not chat-GPT

Also thanks for the help so far!

[–] [email protected] 5 points 1 month ago

Gemini, Perplexity, Poe. Creating a Selenium script isn't that hard for them. You can try running your own, but it's more less likely that it will produce good results. Best coder LLM I've seen out there for hosting is Yi Coder 9B.

[–] [email protected] 2 points 1 month ago

Claude is really good: https://claude.ai/

[–] [email protected] 9 points 1 month ago (1 children)

There is no simplification that you're looking for. It seems you don't have a programing background. If you really need to scrape something, you need to learn a programing language, HTTP, HTML, and maybe javascript. AFAIK, there is no easy way or point and click scrapper building tool. You will need to invest time and learn. Don't worry, you should be able to get it done in 2-3 months if you do invest your time in.

[–] [email protected] 4 points 1 month ago (1 children)

I don't want a point and click scraper, just a guide that isn't assuming I have background + simple mans terms for easier reading. Thanks for believing in me to be able to build the basic skills necessary! Much appreciated :3

[–] [email protected] 5 points 1 month ago* (last edited 1 month ago) (1 children)

I don't a single guide for you but I can layout a road map.

A programming language. I prefer Python.
Basic HTML syntax and CSS selectors
HTTP, specifically methods, status code (no need to memorize all cuz you can go look it up), and cookies

After you got those foundation ready, you can go on and try to build a webscraper. I advice aginst using Scrapy. Not because it is bad but too overwhelming and abstracted for any beginner. I will instead advice you use requests for HTTP, and BeautifulSoup4 for HTML parsing. You will build a more solid foundation and transition to scrapy later when you need those advanced function.

When you get stuck, don't afraid to pause on your attempt and read tutorials again. Head to the Python Community on Discord to get interactive help. We welcome noobs as we once were noobs too. Just don't ever mention scraping there as they can't help if they suspect you're trying to do something inappropriate, malicious, or illegal. They are notoriously aginst yt-dlp which frustrates me a bit. Phrase it nicely and in an generic way. I will be there occasionally offering help.

[–] [email protected] 2 points 1 month ago

The discord thing is a no-go since I don't really know how to make my issue palatable. That's why I used lemmy. Thanks again!

[–] [email protected] 4 points 1 month ago (2 children)

Selenium is really more of a testing framework for frontend developers, and could theoretically be used for scraping, but that would be somewhat like buying a car based on the paint and not looking in detail under the hood.

I can't say I've ever worked with scrappy, but the tool I would use for web scraping with Python is BeautifulSoup. This tutorial seems decent enough, but you will need to understand basic web concepts like IDs, classes, tags, and tag attributes to get the most out of the tutorial: https://geekpython.medium.com/web-scraping-in-python-using-beautifulsoup-3207c038723b

W3Schools will also be your friend if you have questions about HTML/CSS selectors in general: https://www.w3schools.com/html/default.asp

Understanding regular expressions and/or xpath would also be very helpful, but are probably best considered to be extra credit in most cases.

I'll try to respond if you have any issues or questions, but hopefully that gives you enough to get started.

[–] [email protected] 6 points 1 month ago (2 children)

The reason to use Selenium is if the website you want to scrape uses javascript in a way that inhibits getting content without a full browser environment. BeautifulSoup is just a parser, it can't solve that problem.

[–] [email protected] 3 points 1 month ago (1 children)

In my experience, this scenario typically means that there is some sort of API (very likely undocumented) that is being used on the backend. That requires a bit more investigation and testing with browser developer tools, the JS Console, and often trial and error. But once you overcome that (admittedly very complex and technical) hurdle, you can almost always get away with just using the requests library at that point.

I've had to do that kind of thing more times than I'd like to admit, but the juice is almost always worth the squeeze.

[–] [email protected] 1 points 1 month ago (1 children)

Well if I was doing it I probably would be trying to focus on browser emulation to avoid having to dig into those sorts of details. It sounds like OP is a beginner and needs a simple method.

[–] [email protected] 2 points 1 month ago

I agree that OP sounds like a beginner, and what you've suggested is likely the best approach for someone who is familiar with frontend tools and frameworks. Selenium (and admittedly BeautifulSoup) is probably too low level for this particular user, but that doesn't mean they can't still learn some fundamentals while solving this problem without resorting to something as heavy and complicated as background browser emulation and rendering. I could be wrong though.

[–] [email protected] 1 points 1 month ago (1 children)

This was the original plan but it doesn't work as well for this on 'dynamic' websites

[–] [email protected] 1 points 1 month ago (1 children)

IIRC it should be able to be made to work since it does everything a browser does, found this search result, though it has been a while since I used it myself at all. Another thing you might try that has worked for me is iMacros, that's a little simpler and more basic than Selenium but should work for what you say you want to do.

[–] [email protected] 1 points 1 month ago

I test with IDLE for python + use selenium for driver directory (geckodrive)

[–] [email protected] 1 points 1 month ago (1 children)

My current script uses bs4 and request imports. It also has the selenium import for geckodrive but I am considering just removing that feature lol

[–] [email protected] 1 points 1 month ago (1 children)

I would love to see your code, but I understand if this forum isn't the most ideal place to share.

[–] [email protected] 1 points 1 month ago (1 children)

I could send it to you privately if you let me know ur discord or something

[–] [email protected] 1 points 1 month ago

I'm not currently on Discord, could you upload the code to pastebin or something similar?

https://pastebin.com/

[–] SpaceBishop 3 points 1 month ago* (last edited 1 month ago) (1 children)

I am no expert, but I have used Python in a professional environment, and helped on board a Python newbie to build out his first project.

It would be helpful to know what your environment looks like (what OS you are running, Python version, terminal interface -- are you running cmd, powershell, terminal) and which steps prompts the reported error messages.

Starting from the first time running Python using a Windows computer, the first steps should be

Launch Powershell as admin and type in the following commands:

set-executionpolicy remotesigned

winget install python

mkdir python

cd python

python -m venv scraper

.\scraper\Scripts\activate

Following that you should be able to use pip to install more modules or packages. I have Visual Studio Code as my IDE, and that means from there I can also run code to open the text editor to write whatever code I intend to run. Be sure to save it to C:\Users\youruseraccount\python If your scripts are saved to that folder, you can run them from powershell by just typing in their filename. Any time you run scripts, open powershell and type cd python and then .\scraper\Scripts\activate Hit enter, then type in the name of the script you want to run.

This information dump is not the most detailed, but it should get you to the point that you can run your scripts.

[–] [email protected] 1 points 1 month ago

I am having to frankinscript because resources don't really give out the code for my needs. I am using command prompt from win powershell and testing with python IDLE

[–] [email protected] 3 points 1 month ago* (last edited 1 month ago)

Depending on what you want to scape, that's a lot of overkill and overcomplication. Full website testing frameworks may not be necessary to scrape. Python with it's tooling and package management may not be necessary.

I've recently extracted and downloaded stuff via Nushell.

Requirement: Knowledge of CSS Selectors
Inspect Website DOM in Webbrowser web developer tools
1. Identify structure
2. Identify adequate selectors; testable via browser dev tools console document.querySelectorAll()
Get and query data

For me, my command line terminal and scripting language of choice is Nushell:

let $html = http get 'https://example.org/'
let $meta = $html | query web --query '#infobox .title, #infobox .tags' |  | { title: $in.0.0 tags: $in.1.0 }
let $content = $html | query web --query 'main img' --attribute data-src
$meta | save meta.json

1..30 | each {|x| http get $'https://example.org/img/($x).jpg' | save $'($x).jpg'; sleep 100ms }

Depending on the tools you use, it'll be quite similar or very different.

Selenium is an entire web-browser driver meaning it does a lot more and has a more extensive interface because of it; and you can talk to it through different interfaces and languages.

[–] [email protected] 3 points 1 month ago (1 children)

Selenium is a “driver” that controls browsers, you would need some type of software to actually drive it. If you have programming experience it’s pretty easy to get going.

Personally, I use it in Ruby on Rails development for unit testing but I also use it to log in to websites and perform some actions on behalf of a user (where the websites don’t offer an API).

I don’t have experience with the others, but thought my comment may or may not be useful.

[–] [email protected] 1 points 1 month ago (2 children)

I don't have programming experience and what sorts of software can "drive" the driver?

[–] [email protected] 1 points 1 month ago (1 children)

I probably can’t be of much help yet unless for some reason you want to take up programming. I’m just not familiar with web scraping outside programming.

[–] [email protected] 1 points 1 month ago

I wouldn't mind taking it up if I could just focus on what i'm interested in working on. Python seems simple enough after spending 9 hours trying to get this to work lol. I don't want to "reinvent the wheel" as much as I just want to be able to understand and work with tools that already exist.

[–] [email protected] 1 points 1 month ago

You're going to want to do a lot more reading ahead of time then. It's not hard, but you really need to know some basics about javascript before you start.

[–] [email protected] 2 points 1 month ago (1 children)

We use node.js with puppeteer for some of our web crawling at work. It's pretty straightforward once you have a basic script to launch it. If you havent already I'd highly suggest installing vs code. You install node.js, then using npm (node package manager) install puppeteer and whatever other dependencies you might have. Someone out there probably has a basic js file out there that will open chrome, or just ask an LLM (I just use ChatGPT, they're all the same shit). From there you just need to navigate to your pages, then use a queryselector and .click() to click on your elements. It's all javascript from there.

Pro tip: write your queryselectors in your browser using the inspect element/console tab, then put it in your JS file. Nothing is worse than being 10 minutes into a crawl and you've got a queerselector.

[–] [email protected] 1 points 1 month ago

I don't like to touch js so ive being going python only. (besides basic html & Css) but I found puppeteer and didn't really get it.

[–] [email protected] 2 points 1 month ago (1 children)

I'm an automation engineer. For scripting with with python and selenium what's the issue

[–] [email protected] 1 points 1 month ago (1 children)

I'm attempting to make a webscraper that can grab online books that are stored within the site or stored with a direct link to the storage site. I don't want to reinvent this but finding one that I can work and/or build off of is hard due to my lack of experience and vague resources.

[–] [email protected] 1 points 1 month ago (1 children)

Sure, ok but this should be pretty straightforward. How technical are you? There should be record and playback tools but I haven't used one of those in years.

[–] [email protected] 2 points 1 month ago (1 children)

Very little. I know basic html + css but I am trying to work with python

[–] [email protected] 1 points 1 month ago (1 children)

Most selectors are going to be based off html/css so you've got some experience there, do you know how to use developer tools in Chrome?

[–] [email protected] 1 points 1 month ago

Yeah I was using it before I realized I might need a scraper.

[–] [email protected] -1 points 1 month ago

Ask dr gpt