HACKER Q&A
📣 hilti

Lightweight data analytics using SQLite, Bash and DuckDB – too simple?


Over the last 12 months, in parallel to using Google BigQuery, I have built my own processing pipeline using SQLite and DuckDB.

What amazes me is that it works surprisingly well and costs much less than using BigQuery.

Roughly speaking, here's what I do: A SQLite database receives IoT sensor data via a very simple PHP function. I currently use the FlightPHP framework for this. The data is written to a table within the SQLite database (WAL mode activated) and states are updated by the machines using triggers.

Example of a trigger

CREATE TRIGGER message_added AFTER INSERT ON messages BEGIN INSERT OR REPLACE INTO states VALUES ( new.id, new.status, new.time_stamp, new.current_present, new.voltage_present)

This allows me to query the current status of a machine in real time. To do this, I again use a simple PHP function that provides the data via SSE. In the frontend, a simple Javascript method (plain vanilla JS) retrieves the JSON data and updates the HTML in real time.

    const source_realtime = new EventSource("https://myapi/sse_realtime_json");
    source_realtime.onmessage = function(event) {
        var json = JSON.parse(event.data); };
For a historical analysis - for example over 24 months - I create a CSV export from the SQLite database and convert the CSV files into Parquet format.

I use a simple BASH script that I execute regularly via CronJob.

Here is an excerpt

# Loop through the arrays and export each table to a CSV, then convert it to a Parquet file and load into the DuckDB database for (( i=0; i<${arrayLength}; i++ )); do db=${databases[$i]} table=${tables[$i]}

  echo "Processing $db - $table"
  
  # Export the SQLite table to a CSV file
  sqlite3 -header -csv $db "SELECT * FROM $table;" > parquet/$table.csv
  
  # Convert the CSV file to a Parquet file using DuckDB
  $duckdb_executable $duckdb_database <
EOF

Now finally my question: Am I overlooking something? This little system works well for currently 15 million events per month. No outtages, nothing like that. I read so much about fancy data pipelines, reactive frontend dashboards, lambda functions ...

Somehow my system feels "too simple". So I'm sharing it with you in the hope of getting feedback.


  👤 perrygeo Accepted Answer ✓
You've discovered a little secret of the industry - most ultra-scalable big data solutions are complete overkill for many scenarios. I've used a technique similar to yours for 20 years and it's only fallen out of fashion in the last 5 years. Not because the workloads are too big but because the industry has willingly chosen to make it more complex (for reasons). Personal computers with local disk are big enough and powerful enough to handle analytics on even medium-sized workloads (into the tens of billions of rows / 100s of GB scale). There's nothing stopping you from doing the simple approach except for dogma and fashion.

The problem is team dynamics. Who executes that bash script and when? What happens if something goes wrong? How do you run it when you're offline, or traveling, or need to use your laptop for something else? How do you and your team track the progress and view the results? And since this is all running in the cloud, how do you track access, security, and costs? And the bus factor - what if you leave and your knowledge leaves with it? What about all the junior developers that want to use FancyNewThing for their resume, can we incorporate that somehow? You need a complex system to support the dozens of people who want to stick their hands in the pipeline; a perfect reflection of Conway's law. These are the organizational problems that "fancy" cloud data pipelines deal with. If you don't need such things, you can (and should) reduce the complexity by orders of magnitude.