If you use DuckDB to run the tests, you can reference those files as if they were tables (select * from 'in.parquet'), and the tests will run extremely fast
One challenge if you're using Spark is that test can be frustratingly slow to run. One possible solution (that I use myself) is to run most tests using DuckDB, and only e.g. the overall test using Spark SQL.
I've used the above strategy with PyTest, but I'm not sure conceptually it's particularly sensitive to the programming language/testrunner you use.
Also I have no idea whether this is good practice - it's just something that seemed to work well for me.
The approach with csvs can be nice because your customers can review these files for correctness (they may be the owners of the metric), without them needing to be coders. They just need to confirm in.csv should result in expected_out.csv.
If it makes it more readable you can also inline the 'in' and 'expected_out' data e.g. as a list of dicts and pass into DuckDB as a pandas dataframe
One gotya is SQL does not guarantee order so you need to somehow sort or otherwise ensure your tests are robust to this
Here's [2] a slide deck by David Wheeler giving an introduction into how it works.
[2] https://www.slideshare.net/justatheory/unit-test-your-databa...
Plenty of constraints, uniques and foreign keys and not nulls. Enum types.
Visuals, dump to csv and plot some graphs. Much easier to find gaps and strange distributions visually.
Asserts in DO blocks, mostly counts being equal.
Build tables in a a _next suffix schema and swap when done.
Never mutating the source data.
Using psqls ON_ERROR_STOP setting.
Avoid all but the most trivial CTEs, preferring intermediate tables that can be inspected. Constraints and assertions on the intermediate tables.
“Wasting” machine resources and always rebuilding from scratch when feasible. CREATE TABLE foo AS SELECT is much simpler than figuring out which row to UPDATE. Also ensures reproducibility, if you’re always reproducing from scratch it’s always easy. State is hard.
Overall i’m quite happy with the workflow and very rarely do we make mistakes that unit tests would have caught. Our source data is complex and not always well understood (10+ years of changing business logic) so writing good tests would be very hard. Because we never touch the raw source data any errors we inevitably make are recoverable.
This talk by Dr Martin Loetzsch helped a lot: https://youtu.be/whwNi21jAm4
You can hook up dbt tests to your CI and Git(hub|lab) for data PRs.
Depending on your needs, you can also look into data observability tools such as Datafold (paid) or re_data (free)
On the back on that professional use I wrote a blog post [2] explaining why you might choose to go down this route as it wasn't the way database was developed way back then (SQL wasn't developed in the same way as the other front-end and back-end code).
A few years later I gave a short 20-minute talk (videoed) to show what writing SQL using TDD looked like for me. It's hard to show all the kinds of tests we wrote in practice at the bank but the talk is intended to show how rapid the feedback loop can be using a standard DB query tool and two code windows - production code and tests.
Be kind, it was a long time ago and I'm sure the state of the art has improved a lot in the intervening years :o).
Chris Oldwood
---
[1] SQL Server Unit: https://github.com/chrisoldwood/SS-Unit
[2] You Write Your SQL Unit Tests in SQL?: https://chrisoldwood.blogspot.com/2011/04/you-write-your-sql...
[3] Test-Driven SQL: https://www.youtube.com/watch?v=5-MWYKLM3r0
We use Microsoft SQL's docker image and spin it up in the background on our laptop/CI server so port 1433 has a database.
Then we have our homegrown migration file runner that will compute a hash of the migrations, make a database template_a5757f7e, and run the hundreds of migrations on it, whenever we add a new SQL migration (todo: make one template build on the previous).
Then we use the BACKUP command to dump the db to disk (within the docker image)
Finally, each test function is able to make a new database and restore that backup from file in less than a second. Populate with some relevant test data, run code, inspect results, drop database.
So our test suite uses hundreds of fresh databases and it still runs in a reasonable time.
(And..our test suite is written in Go, with a lot of embedded SQL strings, even if a lot of our business logic is in SQL)
This is easier if you have the same input every time the tests run, like a frozen database image, because then you can basically have snapshot tests.
When the tests pass, we can change from DuckDB to Spark. This helps decouple testing Spark pipelines from the SparkSession and infrastructure, which saves a lot of compute resources during the iteration process.
This setup requires an abstraction layer to make the SQL execution agnostic to platforms and to make the data sources mockable. We use the open source Fugue layer to define the business logic once, and have it be compatible with DuckDB and Spark.
It is also worth noting that FugueSQL will support warehouses like BigQuery and Snowflake in the near future as part of their roadmap. So in the future, you can unit test SQL logic, and then bring it to BigQuery/Snowflake when ready.
For more information, there is this talk on PyData NYC (SQL testing part): https://www.youtube.com/watch?v=yQHksEh1GCs&t=1766s
Fugue project repo: https://github.com/fugue-project/fugue/
1)Set up a test db instance with controlled data in it as the basis for your test cases. Ideally this data is taken from real data that has caused pipeline problems in the past but scrubbed for PII etc. You can also use or write generators to pad this out with realistic-looking fake data. If you do this the same dataset can be used for demos (once you add data for your demo paths).
2)Write test cases using whatever test framework you use in your main language. Say you code in python, you write pytest cases, java -> junit etc. You can help yourself by writing a little scaffolding that takes a sql query and a predicate, runs the query and asserts the predicate over the result. If you don't have a "main language", just write these test cases in a convenient language.
3)Consider resetting the state of the database (probably by reloading a controlled dump before each test batch) so any tests which involve inserts/deletes etc work. You may actually want to create an entirely new db and load it before each test run so that you can run multiple test batches concurrently against different dbs without contention messing up your results. Depending on your setup you may be able to achieve a similar effect using schemas or (sometimes but not always) transactions. You want each test run to be idempotent and isolated though.
Doing it this way has a number of benefits because it's easy to add your sql test cases into your CI/CD (they just run the same as everything else).
1) The same way you'd write any other tests. Use your favourite testing framework to write fixtures and tests for the SQL queries:
- connect to the database
- create tables
- load test data
- run the query
- assert you get the results you expect
For insert or update queries, that assertion step might involve running another query.2) DBT has support for testing! It's quite good. See https://docs.getdbt.com/docs/build/tests
First, by “testing SQL pipelines”, I assume you mean testing changes to SQL code as part of the development workflow? (vs. monitoring pipelines in production for failures / anomalies).
If so:
1 – assertions. dbt comes with a solid built-in testing framework [1] for expressing assertions such as “this column should have values in the list [A,B,C]” as well checking referential integrity, uniqueness, nulls, etc. There are more advanced packages on top of dbt tests [2]. The problem with assertion testing in general though is that for a moderately complex data pipeline, it’s infeasible to achieve test coverage that would cover most possible failure scenarios.
2 – data diff: for every change to SQL, know exactly how the code change affects the output data by comparing the data in dev/staging (built off the dev branch code) with the data in production (built off the main branch). We built an open-source tool for that: https://github.com/datafold/data-diff, and we are adding an integration with dbt soon which will make diffing as part of dbt development workflow one command away [2]
We make money by selling a Cloud solution for teams that integrates data diff into Github/Gitlab CI and automatically diffs every pull request to tell you the how a change to SQL affects the target table you changed, downstream tables and dependent BI tools (video demo: [3])
I’ve also written about why reliable change management is so important for data engineering and what are key best practices to implement [4]
[1] https://docs.getdbt.com/docs/build/tests [2] https://github.com/calogica/dbt-expectations [3] https://github.com/datafold/data-diff/pull/364 [4] https://www.datafold.com/dbt [5] https://www.datafold.com/blog/the-day-you-stopped-breaking-y...
Basically, treat the query and database as a black-box for testing like you would another third party API call.
I would strongly suggest having a layer of code in your application that is exclusively your data access and keeping any logic you can out of it. Data level tests are pretty onerous to write in the best circumstances and the more complexity you allow to grow around the raw SQL the worse of a time you'll have - swapping out where clauses and the like dynamically is a cost you'll need to eat, and sometimes having a semi-generic chunk that you reuse with some different joins can be more efficient than writing ten completely different access functions with completely different internal logic so judgement is required.
At the end of the day a database is like any other third party software component - data goes in, data comes out... the nice thing is that SQL is well defined and you've got all the definitions so it's easier to find the conditional cases you need to really closely tests... but databases are complex beasties and it'll never be easy.
Best ideas IMO (no particular order):
- make SQL dumber, move logic that needs testing out of SQL
- use an ORM that allows composing, disconnect composition & test (ie EF for .NET groups, test the LINQ for correct filtering etc, instead of testing for expected data from a db) (I see this has already been recommended elsewhere)
* edited formatting
Be wary of too many techniques that are supposed to be making it easier to test, but also make it hard for you to leave a query pipeline. In particular, SQL should be very easy in the "with these as our base inputs, we expect these as our base outputs." Trying to test individual parts of the queries is almost certainly doomed to massive bloat of the system and will cause grief later.
Quick web search confirms suspicions, it is not easy
We've written our own tool to compare different data sources against each other. This allows, for example, to test for invariants (or expected variations) between and after a transformation.
The tool is open source: https://github.com/QuantCo/datajudge
We've also written a blog post trying to illustrate a use case: https://tech.quantco.com/2022/06/20/datajudge.html
DBT, specifically, DBT-2 is a suit of tests designed to benchmark a database system. These tests aren't interested in, eg. correctness of an application that is using the database. They are meant to be testing the system as a whole by modeling some sort of an "average business" and defining some sort of an "average business operation" and estimating how many of such operations can a particular deployment of a system perform.
Such tests are rarely of any interest to application developer, and are more geared towards DBAs who execute such tests to estimate the efficiency of a system they deploy or to estimate the amount of hardware necessary to support a business.
MySQL DBT2 suit: https://dev.mysql.com/downloads/benchmarks.html
PostgreSQL DBT2 suit: https://wiki.postgresql.org/wiki/DBT-2
Those tools are typically modeled on TPC-B... And, it would require a separate discussion to describe why these tests are obsolete and why there isn't really any replacement.
----
However, from the rest of your question it seems that you may use DBT acronym in some other way... So, what exactly are you testing? Are you interested in performance? A benchmark? Schema correctness? Are you perhaps trying to simply test the application that is using a SQL database and you want to avoid dealing with the database setup as much as possible?
As others have mentioned, you want to compare the results of your queries against a previously known 'good' state of the data. So, as you're making data model changes, you can regularly check your development environment against production to see how your changes affect the data.
Data profiling is the perfect tool for this, especially when your pipeline reaches a certain size, or you're dealing with very large datasets.
I work on the team creating PipeRider.io, which uses data profiling comparisons as a method of "code review for data".
It becomes particularly useful when you automate generating data profiles of development and production environments in CI, and attach the data profile comparison to the pull request comment. It makes seeing the impact of changes so much easier.
Here's an article that discusses the benefits of this: https://blog.infuseai.io/why-you-lack-confidence-merging-dbt...
We have two flavours of test: one that drops the transaction each time, ensuring a clean, known state. And one that doesn’t, allowing your tests to avoid lots of overhead by “walking through a series of incremental states”.
Yes, some might call the latter heresy. But it works great.
During test: - At the start of the test (fixture), run a new DB instance - Apply DB schema. - possibly: Remove constraints that that would disturb your tests (eg. unimportant foreign keys) - possibly: Add default values for columns that are not important for your test (but do with caution) - run you test - Assert results (maybe also directly as access to databse or via a dump of tables). - Tear down database possibly removing all data (except error logs).
I used this pattern to test software that uses MySQL or MariaDB server. For Microsoft SQL server it may be enough to create a new database instead of running a new instance (possible but not as easy as for MySQL/MariaDB).
On CI server this can be used to run tests against all required DB server types and versions.
Look at this JetBrains survey: https://www.jetbrains.com/lp/devecosystem-2021/databases/
Around half of the people never debug stored procedures. Three quarters of people don't have tests in their databases. Only half of the people version their database scripts.
Personally: the answer is containers. Spin up a database in a container (manually or on CI server) and do whatever you need with it. Seed it with some test data, connect an app to it, check that app tests pass when writing to and reading from a live database (as opposed to mock data stores or something like H2), then discard the container.
Even if you don't have a traditional app, throwaway instances of the real type of DB that you'll be using are great, both for development and testing.
I would love testing to work.
Have set up and maintained several unit test suites in Jest.
Wrote several large e2e test suites in Cypress.
I don't think anyone won time from simply having a manual checklist and testing manually.
Maybe me and my former teammates are doing it wrong. Talking 8+ teams, from corporate to startup.
But loved de proven wrong. E2e def. catched most issues.
It's a lot faster and easier than dealing with containers and the like.
Example:
db = Fake().expect_query("SELECT * FROM users", result=[(1, 'Bob'), (2, 'Joe')])
Then you do:
db.query("SELECT * FROM users")
and get back the result.
In Python if you do this in a context manager, you can ensure that all expected queries actually were issued, because the Fake object can track which ones it already saw and throw an exception on exit.
The upside of this is, you don't need any database server running for your tests.
update: This pattern is usually called db-mock or something like this. There are some packages out there. I built it a few times for companies I worked for.
* It has a language level module support, similar to other languages. Thus SQL functions are reusable across multiple codebases without depending on code generation tricks. One of the major blocker for SQL adoption has been complex domain specific business logic and now the situation is better.
* It has an official unit test support. Google use Blaze (which is known as Bazel externally), so adding a unit test for SQL code is as simple as adding a SQL module (and its test input) dependency to SQL test target, write a test query and its expected output in a format of approval testing. Setting up the DB environment is all handled by the testing framework.
* It has an official SQL binary support. It's just a fancy name for handling lots of tedious stuffs for running a SQL query (e.g. putting everything needed into a single package, performing type checks, handling input parameters, managing native code dependencies for FFI etc etc).
None of those are technically too sophisticated at least in theory, actually these combined together become pretty handy. Now I can write a simple SQL module which mostly depends on other team's SQL module, do a simple unit test for it then run a SQL binary just as other languages. I haven't worried a single time on how to set up a DB instance. This loop is largely focused on OLAP so it's a bit different for OLTP, which has another type of established testing patterns.
And that can be done by dumping the database (possibly verifying the content of that dump), taking a backup, restoring the backup to a fresh container, then comparing dump of that freshly restored database to the one you took at the start.
I dont know the technical detail of how to set it up, it was already setup when I worked there
But basically, we wrote SQL script that included statements to
1. create the db structure, tables or views
2. insert statement to enter test data (you can insert corner cases etc..)
3. ran the function or procedure
4. ran an assert to confirm if results are to our expectation
test script were ran by the CI/CD process
As the meme say: App worked before. App work afterwards. Can’t explain that.
[1] https://hexdocs.pm/ecto_sql/Ecto.Adapters.SQL.html#module-sq...
Rails for Ruby comes with some pretty nice setups for testing the database code. There's a test DB by default with the same schema as Production, and the usual test frameworks (FactoryBot and RSpec) make it easy to set up some data in the actual DB for each spec, run model code that makes actual SQL queries, and assert against the results.
I would have hoped most other web hosting frameworks would make as much effort to making it straightforward to test your database code, but it doesn't really seem to be the case.
In Rust, there's a very handy crate called sqlx. What it does is, at compile time, it runs all of the SQL in your codebase against a copy of your database to both validate that it runs without errors and map the input and output types to typecheck the Rust code.
When it comes to stuff like validating that your queries are performant against production datasets or that there isn't any unexpected data in production that breaks your queries, well I pretty much got nothing. Maybe try a read replica to execute against?
More trivial example:
{%
call dbt_unit_testing.test(
'REDACTED',
'Should replace nullish values with NULL'
)
%}
{% call dbt_unit_testing.mock_source('REDACTED', 'REDACTED', opts) %}
"id" | "industry"
1 | 'A'
2 | 'B'
3 | ''
4 | 'Other'
5 | 'C'
6 | NULL
{% endcall %}
{% call dbt_unit_testing.expect(opts) %}
"history_id" | "REDACTED"
1 | 'A'
2 | 'B'
3 | NULL
4 | NULL
5 | 'C'
6 | NULL
{% endcall %}
{% endcall %}
Stored procedures are a different beast though. Having significantly struggled to debug stored procedures running in MSSQL on a Macbook (on Windows SQL Management Studio lets you set breakpoints, on Mac you're SOL), if I was building an application based on them I'd definitely try to spin up some kind of testing framework around them. I guess what I'd probably do is have a temporary database and some regular testing framework that nukes the db, then calls the stored proc(s) with different inputs and checks what's in the tables after each run. Sounds slow and clunky?
Starting with a framework that is programming language first (IE Spark) can help you build your own tooling to help you actually build unit tests. It's frustrating though, that this isn't just common across other ETL tooling.
I ought to produce unit tests that prove that tuples from each join operation produces the correct dataset. I've only ever tested with 3 join operations in one query.
For a user perspective, I guess you could write some tooling that loads example data into a database and does an incremental join with each part of the join statement added.
For my own use-cases, I usually test this at the application level and not the DB level. This is admittedly not unit-testing my SQL (or stored procs or triggers) but integration-testing it.
You can achieve the same thing with "docker commit"-ing data into docker images of your dB engine of choice, and firing your queries on them, but that only really works with smaller datasets.
not a unique problem with sql, btw.
A list of best practices: https://docs.getdbt.com/guides/legacy/best-practices
And shameless plug but there's a chapter on modeling in my book: https://theinformedcompany.com
A ref() concept like dbt's is sufficient. When testing, have ref output a different (test-x) name for all your references.
The backbone for this is that we spin up a DB per unit test, so we don't have to worry about shared state.
https://news.ycombinator.com/item?id=34580675
I am always baffled by why this ins't more popular way of writing SQL.
Most of the times there is a layer around your sql (a repository, a bash script or whatever) that you can use for integration testing.
Anyone have any recommendations on testing SSIS ?
One of the premises that we have is the ability to instantly create a test environment by creating a branch. I'd love to hear what you think about it.
This way i don't waste time with unit tests that quickly get old and no one wants to maintain and run
Excellent for checking delete queries before running them.
in postgresql a cool tool for performance is"hypothetical indexing", which predicts how the optimizer will use indexes in any sql query. i could see an automated testing tool written around "hypothetical indexing".
also, i believe MSServer supports HI.
for real though I love tools like SequelPro or TablePlus that let me work out a query before I bake logic or stuff into my apps. Also sometimes I use it to work out the data needed for reports. I am working with salesforce for the first time in my life and apparently there are tools that let me treat it like I'm used to SequelPro.
But my app is for six users at one site, it’s not mission critical, and the sqlite DB is backed up hourly.
Life’s too short for (unnecessary) testing.
Learn to use IMPORT TABLESPACE in MySQL or just dump and import SQL.
Every time you run a test you set up the mock databases again.
There are a few types of tests one would like from a SQL pipeline, each with a different value add:
- Quality assurance tests: these are things like DBT tests, they mainly test the accuracy of the result after the tables are produced. Examples of this would be tests like "this column should not contain any `null` values" or "it should have only X, Y and Z values". They are valuable checks, but the matter of the fact is that there are many cases where running this sorts of tests after the data is produced is a bit too late.
- Integration tests: specify an input table and your expected output, and run your queries against it, the end result must match with the expectations at all times. This is useful for running them regularly and serve as "integration tests" for your SQL assets. They allow validating the logic inside the query, provided that the input is covering the cases that needs to be covered, they can be executed in CI/CD pipelines. We are exploring a new way of doing this with Blast CLI, effectively running a BigQuery compatible database in-memory and running tests against every asset in the pipeline locally.
- Validation tests: these tests aim to ensure that the query is syntactically correct on the production DWH, usually using tricks like `EXPLAIN` or dry-run in BigQuery. These sorts of tests would ensure that the tables/fields referenced actually exist, the types are valid, the query has no syntax errors, etc.. These are very useful for running in CI after every change, effectively allowing catching many classes of bugs.
- Sanity checks: these are similar to the quality assurance tests described above, but with a bigger focus on making sense out of the data. They range from "this table has no more rows than this other table" to business-level checks such as "the conversion rate for this week cannot be more than 20% lower compared to last week". They are executed after the data is produced as well, and they would serve as an alerting layer.
There is no silver bullet when it comes to testing SQL, because in the end what is being tested is not just the SQL query but the data asset itself, which makes things more complicated. The fact that SQL has no standardized way of testing things and the language has a lot of dialects make this harder than it could have been. In my experience, I have found the combination of the strategies above to have a very good coverage when it comes to approximating how accurate the queries are and how trustworthy the end result is, provided that a healthy mix of them is being used throughout the whole development lifecycle.
I follow those steps on my pipeline:
Every time I commit changes the CI/CD pipeline follow those steps, on this order:
- I use sqitch for the database migration (my DB is postgresql).
- Run the migration script `sqitch deploy`. It runs only the items that hasn't been migrated yet.
- Run the `revert all` feature of sqitch to check if the revert action works well too.
- I run `sqitch deploy` again to test if the migration works well from scratch.
- After the schema migration has been applied, I run integration tests with Typescript and a test runner, which includes a mix of application tests and database tests too.
- If everything goes well, then it runs the migration script to the staging environment, and eventually it runs on the production database after a series of other steps on the pipeline.
I test my database queries from Typescript in this way:
-in practice I'm not strict on separating the tests from the database queries and the application code, instead, I test the layers as they are being developed, starting from simple inserts on the database, where I test my application CRUD functions that is being developed, plus to the fixtures generators (the code that generate synthetic data for my tests) and the deletion and test cleanup capabilities.
-having those boilerplate code, then I start testing the complex queries, and if a query is large enough (and assuming there are no performance penalties using CTE for those cases), I write my largue queries on small chunks on a cte, like this (replace SELECT 1 by your queries):
export const sql_start = `
WITH dummy_start AS (
SELECT 1
)
export const step_2 = `${sql_start},
step_2 AS (
SELECT 1
)
`;
export const step_3 = `${step_2},
step_3 AS (
SELECT 1
)
`;
export const final_sql_query_to_use_in_app = ` ${step_3},
final_sql_query_to_use_in_app AS(
SELECT 1
)
SELECT \* FROM final_sql_query_to_use_in_app
`;Then on my tests I can quickly pick any step of the CTE to test it
import {step_2, step_3, final_sql_query_to_use_in_app} from './my-query';
test('my test', async t => {
//
// here goes the code that load the fixtures (testing data) to the database
//
//this is one test, repeat for each step of your sql query
const sql = `${step_3}
SELECT * FROM step_3 WHERE .....
`;
const {rows: myResult} = await db.query(sql, [myParam]);
t.is(myResult.length, 3);
//
// here goes the code that cleanup the testing data created for this test
//
});
and on my application, I just use the final query: import {final_sql_query_to_use_in_app} from './my-query';
db.query(final_sql_query_to_use_in_app)
The tests start with an empty database (sqitch deploy just ran on it), then each test creates its own data fixtures (this is the more time consuming part of the test process) with UUIDs as synthetic data so I don't have conflicts between each test data, which makes it possible to run the tests concurrenlty, which is important to detect bugs
on the queries too. Also, I include a cleanup process after each tests so after finishing the tests the database is empty of data again.For sql queries that are critical pieces, I was be able to develop thounsands of automated tests with this approach and in addition to combinatorial approaches. In cases where a column of a view are basically a operation of states, if you write the logic in sql directly, you can test the combination of states from a spreadsheet (each colum is an state), and combining the states you can fill the expectations directly on the spreadsheet and give it to the test suites to run the scenarios and expectations by consuming the csv version of your spreadsheets.
If you are interested on more details just ping me, I'll be happy to share more about my approach.
For testing:
Run your query/pipeline against synthetic/manual data that you can easily verify the correctness of. This is like a unit test.
Run your query/pipeline on sampled actual data (eg 0.1% of the furthest upstream data you care about). This is like an integration test or a canary. Instead of taking 0.1% of all records you might instead want to sample 0.1% of all USERID so that things like aggregate values can be sanity checked.
Compare the results of the new query to the results from the old query/pipeline. This is like a regression test. You may think this wouldn’t help for many changes because the output is expected to change, but you could run this only on e.g. a subset of columns.
Take the output of the new query (or sampled query, or the manual query) and feed it to whatever is downstream. This is like a conformance test.
For reliability:
If the cost is not prohibitive, consider persisting temporary query results (eg between stages of your pipeline) for 1-2 weeks. This way if you catch a bug from a recent change you only need to rerun the part of your pipeline after the breakage. May not make sense to do if your pipeline is not big
If the cost is not prohibitive you could also run both the new and old versions of the pipeline for ~a week so that you can quickly “rollback”. Ofc whether this is viable depends on what you’re doing.
The big failure modes with SQL pipelines IME are
1. unexpected edge cases and bad data causing queries to fail (eg you manually test the new query and it works fine, but in production it fails when handling Unicode)
2. not having a plan for what to do when a bug gets caught after the fact
3. barely ever noticing bugs or lost data because nobody is validating the output (for example, if you have a pipeline that aggregates a user’s records over a day, any USERID that’s in the input data for that day should also be in the output data for that day).
4. This can be very hard to solve depending on your circumstances, but upstream changes in data are the most annoying and intractable to solve. The best case here is you either spec out the input data closely OR have some kind of testing in place that the upstream folks run before shipping changes.
To address these, you need to take the approach of expecting things to fail, rather than hoping they don’t. This is common practice in many SWE shops these days but the culture in the data world hasn’t quite caught up. I think part of the problem is that automating this testing usually requires at least some scripting/programming which is outside the comfort zone for many people who “just write SQL.”
Other languages are too complicated. :(“
Everyone today: “tries using sql
Oh wow, the tooling is quite basic, and you can’t express complex data structures and imperative code. :(“
What did you expect?
Look, I spent 4 years in this rabbit hole, and here’s my advice:
Don’t try to put the square peg in the round hole.
You want easy to write, simple code and pipelines? Just use sql.
Have a dev environment and run everything against that to verify it.
Do not bother with unit testing your CTEs, it’s hard to do, there are no good tools to do it.
If you want Strong Engineering TM, use python and spark and all the python libraries that exist to do all that stuff.
It won’t be as quick to write, or make changes to, but it will be easier to write more verifiably robust code.
If you treat either as something it is not (eg. Writing complex data structures and frameworks in sql) you’re using the wrong tool for the outcome you’re trying to achieve.
It’ll take longer and “feel bad”, not because the tool is bad, but because you’re using it in a bad way.