HACKER Q&A
📣 TheAlchemist

Python – Pandas usage in production, best practices


I'm wondering if and how people are using Pandas (in python) in production and if yes, what kind of tools / best practices you follow ?

I use Pandas all the time for data exploration and quick dashboarding, but as soon as I start typing a more 'production ready' code I find it quite hard.

Some drawbacks:

- type hinting is useless - methods transforming dataframes basically looks like: def foo(df_input: pd.DataFrame) -> pd.DataFrame

- column types - no type enforcement which can lead to subtle or not so subtle bugs (ex with date vs datetime)

- naming consistency - as column names are mostly strings, it's easy to end up with inconsistent names across the codebase

- you basically need to investigate the code first to understand what it will do with a dataframe (which columns are used and how) to be able to use it

I know there are some packages for specifying schemas for dataframes (pandera, typedframe), one can also encapsulate them in custom classes and enforce some checks.

Any pointers and experiences on what are the best practices here would be highly appreciated.


  👤 bradwood Accepted Answer ✓
You seem to be hung up on typing here. But, with care, you can make any code solid without necessarily relying on types, which, in Python's case, AFAIK, is a build-time safety net, rather than run-time protection.

I think a lot of this comes down to how your construct your code, which is often some kind of data pipeline. Wherever possible, I try to do my pandas code in a functional style -- lots of pure functions that operate on a dataframe and return a a derivative dataframe with only a simple transformation (ideally immutably, if this is not too expensive).

Chaining these together can be done with a functional library, or a simple implementation of `pipe()` or `pipeeither()`[1].

The main advantage here is that testing all the functions that go into this pipeline, becomes much easier. No side effects, no mocking, and very easy to test with `pytest` or similar.

Spending time writing the tests might seem laborious but it's worth it, and you can make these robustly able to exercise all kinds of conditions, including things like empty dataframes, weird data (NaNs, etc), in the inputs.

Finally, keep all charting completely away from the data manipulation functions and treat these separately. They can be a PITA to write automated tests for, and probably just need eyeball manual testing, so make them as simple as possible, without any data manipulation in these.

Not a silver bullet, I'm afraid, but the approach does work but you need to be thorough and structured in your approach.

[1]: https://news.ycombinator.com/item?id=34406420