Tips for software engineering sanity with Databricks notebooks?

Question

Folks, what are your best tips and practices for getting some software engineering sanity into developing Databricks notebooks? In particular larger data pipelines that are orchestrated from something like Azure DataFactory.It is driving me nuts to use auto-saved versions instead of clearly defined explicitly defined commits that can be proof-read, the lack of my favourite tools from my local environment, like vim, and troubles testing things properly.I have already found some general hints like modularizing code etc, but I wanted to hear from people in the trenches, if you found some practices or set of practices that actually made a big difference in the development experience?

alexott · Accepted Answer

You can use Databricks Repos (https://docs.databricks.com/repos/index.html) specifically files in repos (https://docs.databricks.com/repos/work-with-notebooks-other-...) functionality that allows to use Python files (not notebooks!) as Python modules.
Another alternative is to split notebooks into “library notebooks” that just define transformations, and “orchestration notebooks” that use code library notebooks to execute a “business logic”.
In both approaches you can do code testing, etc.
P.S. I have a demo of both approaches here: https://github.com/alexott/databricks-nutter-repos-demo

localbolu · Answer

Quite timely post as I'm actively experimenting on this. As for my organization, we're trying out Databricks Repos. Tbd how that goes, as an IDE it's incredibly clunky but it gives our data scientists access to the larger codebase. You might also consider nbdime to manage notebook diffs.

mrpinklestube · Answer

If you are willing to try alternative options, try RATH: https://github.com/Kanaries/Rath. It is an Open Source Augmented Analytics BI tool that can works great as a Data Bricks alternative.

PaulHoule · Answer

With Jupyter you can check things into git.
You probably want to have a pre-commit script that deletes all the data and just leaves the code. Some people really hate that because they like having notebooks with results in them in the git repository to read, but if you have data mixed with your code you will have the worst time merging.
Tools like databricks lumber on with inadequate version control because people are used to everything being screwed up all the time when it comes to "data science".

samuell · Answer

Just learned one can run debugging via pdb in notebooks: https://docs.databricks.com/_static/notebooks/python-debugge...

lucasterra · Answer

Databricks has Git integration, might be worth checking it out: https://docs.databricks.com/repos/index.html