My question is: should I go for a cloud service or could I avoid using GCP/AWS/Azure, etc., and set up a virtual machine with open-source software only (which software do you recommend)?
Tutorials, blogs, etc in the matter are welcome!
Obs this advice depends a bit on exactly the type of data, but as generic advice it's probably the simplest place to start
It installs as a single binary, reads every data format, interacts with Postgres, MySQL, and even SQL Server, and has the most efficient and versatile SQL engine.
At this moment in time, nothing beats ClickHouse.