this post was submitted on 20 Aug 2024
21 points (100.0% liked)

Self Hosted - Self-hosting your services.

11197 readers
1 users here now

A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.

Rules

Important

Beginning of January 1st 2024 this rule WILL be enforced. Posts that are not tagged will be warned and if not fixed within 24h then removed!

Cross-posting

If you see a rule-breaker please DM the mods!

founded 3 years ago
MODERATORS
 

Say I have a large txt or CSV file with data I want to search. And say I have several files.

What is the best way to index and make this data searchable? I've been using grep, but it is not ideal.

Is there any self hostable docker container for indexing and searching this? Or maybe should I use SQL?

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 2 points 3 weeks ago (3 children)

Files won't change and are hundreds of GBs

[–] [email protected] 1 points 3 weeks ago

ok, database it is then

[–] [email protected] 1 points 3 weeks ago (2 children)

Are they roughly 55GB compressed?

[–] [email protected] 1 points 3 weeks ago

Spill the beans!

[–] [email protected] 1 points 3 weeks ago* (last edited 3 weeks ago)

Could use Polars, afaik it supports streaming from CSVs too, and frankly the syntax is so much nicer than pandas coming from spark land.

Do you need to persist? What are you doing with them? A really common pattern for analytics is landing those in something like Parquet, Delta, less frequently seen Avro or ORC and then working right off that. If they don't change, it's an option. 100 gigs of CSVs will take some time to write to a database depending on resources, tools, db flavour, tbf writing into a compressed format takes time too, but saves you managing databases (unless you want to, just presenting some alternates)

Could look at a document db, again, will take time to ingest and index, but definitely another tool, I've touched elastic and stood up mongo before, but Solr is around and built on top of lucene which I knew elastic was but apparently so is mongo.

Edit: searchable? I'd look into a document db, it's quite literally what they're meant for, all of those I mentioned are used for enterprise search.