this post was submitted on 11 Feb 2024
19 points (100.0% liked)

Programming

16769 readers
81 users here now

Welcome to the main community in programming.dev! Feel free to post anything relating to programming here!

Cross posting is strongly encouraged in the instance. If you feel your post or another person's post makes sense in another community cross post into it.

Hope you enjoy the instance!

Rules

Rules

  • Follow the programming.dev instance rules
  • Keep content related to programming in some way
  • If you're posting long videos try to add in some form of tldr for those who don't want to watch videos

Wormhole

Follow the wormhole through a path of communities [email protected]



founded 1 year ago
MODERATORS
 

I'm working on a query engine, essentially a tool to scan/filter/annotate by lookups/group by/aggregate a large dataset, tens-of-terabytes range. The compute part seems to be a bottleneck for me (I'll be doing around 80-300 GB/s of reads, and yes, I will have hardware capable of providing that kind of throughput). My hypothesis is that by encoding query in form of template arguments I can make the compiler generate code optimized for a specific type of query (like, the filtering or aggregation keys). But I do not know what queries will users send, so I need a way to instantiate templates at runtime.

Sounds simple: for a new type of query invoke a compiler at runtime to build a dynamic library with a new instantiation, then dynload it and off we go. Some prior work is here, though I'm pretty sure any JIT compiler also can counts here. But there's enough technical details to worry about, and at the same time this idea isn't novel, so I wonder—are there any packaged solutions for this kind of approach?

you are viewing a single comment's thread
view the rest of the comments
[–] [email protected] 1 points 6 months ago

Personally I think child processes are the right approach for this. Launch a new process* for each query and it can (if you choose to go that route) dynamically load in compiled code. Exit when you’re done, and the dynamically loaded code is gone. A side benefit of that is memory leaks are contained, since all memory you allocate is about to be removed anyway.

I'd probably be fine with hundreds or thousands of these hanging in memory. I suspect the generated code for a single query would be in hundreds of kilobytes, maybe a megabyte. But yeah, this is one of those technical details I'd worry about.

Honestly, I wonder if you could just use an actual HTTP server for this? They can handle hundreds or even thousands of simultaneous requests. They can handle requests that complete in a fraction of a millisecond or ones that run for several hours. And they have good tools to catch/deal with code that segfaults, hits an endless loop, attempts to allocate terabytes of swap, etc. HTTP also has wonderful tools to load balance across multiple servers if you do need to scale to massive numbers of requests.

Not sure how a HTTP server would solve the CPU bottleneck of scanning terabytes of data per query?