What amazes me is that it works surprisingly well and costs much less than using BigQuery.
Roughly speaking, here's what I do: A SQLite database receives IoT sensor data via a very simple PHP function. I currently use the FlightPHP framework for this. The data is written to a table within the SQLite database (WAL mode activated) and states are updated by the machines using triggers.
Example of a trigger
CREATE TRIGGER message_added AFTER INSERT ON messages BEGIN INSERT OR REPLACE INTO states VALUES ( new.id, new.status, new.time_stamp, new.current_present, new.voltage_present)
This allows me to query the current status of a machine in real time. To do this, I again use a simple PHP function that provides the data via SSE. In the frontend, a simple Javascript method (plain vanilla JS) retrieves the JSON data and updates the HTML in real time.
const source_realtime = new EventSource("https://myapi/sse_realtime_json");
source_realtime.onmessage = function(event) {
var json = JSON.parse(event.data); };
For a historical analysis - for example over 24 months - I create a CSV export from the SQLite database and convert the CSV files into Parquet format.I use a simple BASH script that I execute regularly via CronJob.
Here is an excerpt
# Loop through the arrays and export each table to a CSV, then convert it to a Parquet file and load into the DuckDB database for (( i=0; i<${arrayLength}; i++ )); do db=${databases[$i]} table=${tables[$i]}
echo "Processing $db - $table"
# Export the SQLite table to a CSV file
sqlite3 -header -csv $db "SELECT * FROM $table;" > parquet/$table.csv
# Convert the CSV file to a Parquet file using DuckDB
$duckdb_executable $duckdb_database <
EOFNow finally my question: Am I overlooking something? This little system works well for currently 15 million events per month. No outtages, nothing like that. I read so much about fancy data pipelines, reactive frontend dashboards, lambda functions ...
Somehow my system feels "too simple". So I'm sharing it with you in the hope of getting feedback.
The problem is team dynamics. Who executes that bash script and when? What happens if something goes wrong? How do you run it when you're offline, or traveling, or need to use your laptop for something else? How do you and your team track the progress and view the results? And since this is all running in the cloud, how do you track access, security, and costs? And the bus factor - what if you leave and your knowledge leaves with it? What about all the junior developers that want to use FancyNewThing for their resume, can we incorporate that somehow? You need a complex system to support the dozens of people who want to stick their hands in the pipeline; a perfect reflection of Conway's law. These are the organizational problems that "fancy" cloud data pipelines deal with. If you don't need such things, you can (and should) reduce the complexity by orders of magnitude.