HACKER Q&A
📣 Graffur

Where did Juptyer notebooks come from?


Jupyter notebooks are used in the AI/ML space and are used by data scientists in addition to programmers.

I don't have much experience with them besides using them for some ML tutorials. Where did they come from and how did they come about? The "SDLC" seems unique compared to any other programming I have done.

EDIT: Another way to phrase my question: why do it that way?


  👤 jimmyvalmer Accepted Answer ✓
Juptyer is just repackaged and remarketed IPython, python's command loop or REPL (read-eval-print-loop), with the important addition of the json notebook format which saved and redisplayed the session all pretty. Wolfram's Mathematica had something like this since the 90s.

Data scientists nee statisticians primarily need to play with data so a fast feedback loop is much preferred (as opposed to having to compile). Their end goal is a bunch of graphs which they publish to academia or present to management. Robust and modular code, which is the programmer's goal, is not really something they care about since once that paper's published, it's onto the next one.


👤 PaulHoule
I remember using Mathematica notebooks in the 1990s:

https://www.wolfram.com/technologies/nb/

These are pretty much the same as Jupyter notebooks.

Sometimes I say that Jupyter is the worst thing except for Excel from a software engineering perspective.

Entangling code and data makes a wonderful output product, but it is a nightmare to check that kind of thing into version control. Some particular problems are:

* Many kinds of analysis and modelling should be repeatable. For instance, there might be a "weekly sales report" or a process to build a text analysis model that gets run periodically. Jupyter notebooks don't really give you the tools to do that easily.

* The model of starting at the top and working down runs into many problems such as...

* Some of the steps might take a long time (e.g. spend 24 hours training a deep network)

* Some of the steps connect to an external database that you don't always have access to or need a lot of internet bandwidth to process

* Some of the steps connect to data which has privacy concerns

Commercial data processing jobs are well-modeled as a directed acyclic graph. In some cases you want to compute these on demand, but in some cases you want to freeze something but keep the provenance documented. The Jupyter notebook is not a good place to run a 24 hour training job, but it's a great place to thaw out the model and put it through its paces.