HACKER Q&A
📣 mardiyah

Why is Perl so dwarfed in data science by Python?


Quite baffling. Naturally far more efficient, how come Perl lags far behind or gets dwarfed in data science by Python ?


  👤 thraxil Accepted Answer ✓
As someone who wrote a lot of Perl in the late 90's and early 2000's and moved to Python for scientific work (simulations, etc.) as well as other tasks (web apps, etc), what I saw was:

- Perl actually was really popular for a while back then, especially in the Bioinformatics/Genomics space. It was all over that field, I think partially because it's really easy to think of a genome as just a text string of ATCGs and Perl was really convenient for manipulating text.

- I originally switched to Python for some projects because I had to do simple GUIs and visualization. Pygame and TKinter were much nicer to deal with than Perl's options. If you were just reading data in from one file and writing it out to another, Perl was fine, but the GUI toolkits were miserable.

- Numpy ("Numeric" at the time) was what really sealed the deal and probably pushed a lot of others to switch. Perl had PDL and similar but they weren't as fast or as easy to understand and use. Even if Perl is "Naturally far more efficient" than Python, it's still slow enough that for non-trivial calculations you would not want to use it directly. You have to use one of these other libraries where the actual number crunching is happening in highly tuned FORTRAN or C libraries. Python's options back at that time just leapfrogged Perl's and Numpy had really nice integration with Pygame, which made visualization in your app really smooth. Perl might've caught back up, but that was around when Perl6 was announced and sucked all the air out of the room.


👤 civilized
Data scientist here. Readable code is important in data science. People rely on our products to be based on solid numbers and logic, sometimes without any form of external validation. We can't just scribble line noise in a REPL until we get some output that looks vaguely reasonable. We need to be able to actually read the code and know that what it's doing makes sense.

So readability matters. And Python is one of the most readable languages out there. It remains relatively readable even as the code and data structure get very complex. Its syntax resembles math and pseudo-code more closely than other languages. It feels more like a tool of thought, not just a bunch of alien hieroglyphics you have to write to crudely and inefficiently express how you really think about the problem.

Perl, on the other hand, is one of the least readable languages for a general audience. It is not a tool of thought for non-experts. Its syntax is ugly, obscure, overly symbol-laden, and beloved only by gurus. The supposed "feature" of TIMTOWTDI just makes it more obscure.

The thing I can never get over with Perl is that it didn't even have a sane, idiomatic notation for functions until recently (and even today I suspect it's not widely adopted). There was no such thing as just defining a function f(a,b) of multiple named formal arguments. You had to use some special magic variable and people start talking about shift operators or some such nonsense.

This is the point at which I start saying "yes, all the language choices are Turing complete, but that doesn't mean they're all equally effective choices".


👤 tyingq
I do love Perl.

However, for data science, the biggest drawback vs Python is how you work with complex data structures. Because everything in Python is an object, (de)referencing things is generally straightforward, like:

  somelist[12]['whatever']="abc"
  mylen=len(somelist)
Where Perl devolves into a mess of sigils, when to use $ vs % vs @, wrapping references with {}, etc.

  $somelist[12]{'whatever'}="abc";
  $mylen=length(@somelist);
This gets more complex as the nesting deepens, or as you add things like objects, want to iterate keys of a dict, and so on.

Edit: A real example from the perl data structures (perldsc) man page:

  print "it turns out that $TV{$family}{lead} has ";
  print scalar ( @{ $TV{$family}{kids} } ), " kids named ";
  print join (", ", map { $_->{name} } @{ $TV{$family}{kids} } );

👤 PaulHoule
Python is not a good language for low-level numerics like FORTRAN, but it has good facilities to make foreign language libraries (written in C, FORTRAN, CUDA, etc.) that do high-level operations. There is not just the foreign function interface but also the operator overloading that makes it possible to write something like

   dataframe["A"] + dataframe["B"]
and get a Series. You can make things that look like numbers or arrays but actually do something different.

Both Python and Perl made bold transitions, Python from 2 to 3 and Perl from 5 to 6.

The Python transition was difficult but ultimately successful. On the other hand, Perl rolled the dice and lost. In some alternate universe it might have been the other way around.


👤 creativemonkeys
I'm a software developer who worked with research assistants and scientists for 10 years on large scale multinational projects at an Ivy League university. It's getting better these days, but at the time, those folks didn't have the time to learn proper software development - they had a ton of pressure to deliver results and publish papers.

Here's a bunch of things I was dealing with on a daily basis at the time:

  * No revision control - sometimes files got deleted and research was lost
  * Code that worked by accident, e.g. accessing the first element in an array using @array[$0] instead of @array[0], which worked only because $0 evaluates to "myscript.pl", and @array["myscript.pl"] returns the first element in the array.
  * Some people just liked their code left-indented (read: no indentation at all anywhere)
Most of the code was Perl and Matlab. Perl was hard to read, but it was good for text processing, not so much numerical processing, and Matlab was good for numerical processing, but it was slow and bloated.

When Python came along, with batteries included and with science-friendly libraries like numpy and scikit (so not just text processing, but proper numerical processing) you could get away with coding all your stuff in Python (though a lot of stuff is still Matlab). And on top of that it was easier to learn and to read than Python and it was fast enough. It was a no-brainer, so newcomers stopped learning Perl.

That said, Python's biggest contribution to this world, in my opinion, was forcing people to indent their goddamn code, because, as super-smart as these science guys generally were, they were also super-stubborn.

As for me, I moved from academia into the industry and it's so much nicer to work with other software developers. Sometimes I wonder if I should stop making some billionaire richer and go back to contributing to the scientific field, but, unfortunately, I have bills to pay. Maybe when I retire.


👤 teekert
I got into Python because I outgrew Excel (and Origin) and started with Jupyter notebooks and Pandas. The Pandas functions DataFrame.from_excel and DataFrame.to_excel combined with how visual a notebook is made the transition very easy for me.

I do come across some perl every now and then but it looks like Bash on steriods to me (perhaps because of the $variables), and I never see any DataFrames or similar structure that looks familiar to me.

Idk, Perl never occurred to me and nobody ever recommended it. Is Perl good for data science? Do you have any examples? I never ran into "efficiency problems" with Python btw, I'm not using it at that scale, at what scale would I notice this? I have run out of ram at times, but that's usually when a dataset on disk is already larger than my ram. But then I usually find a tool to deal with the data anyway (for example pyosmium for super large osm/o5m fles).

Edit: I feel that Python also gives me other nice things to get started with, things like Django and Snakemake. This also leads me to recommend Python to other people, it's a broad basis for a lot of stuff. That's why even though some people recommended R when I got started, I choose Python anyway. I have no regrets, unless you are going to blow my mind with some examples...


👤 kasperset
Perl is expressive but the code can be hard to read. In general, Python is readable because it enforces the indentation and other language design choices.

Python also got heavyweight libraries such as Numpy and Pandas which put it in the front. Perl do not have have such well known libraries as far as I know.


👤 dahart
I love perl for regex scripts, where I need to quickly filter or transform a text file. I never liked it for other kinds of programming projects, for some reason to me it doesn’t feel as well suited to, say, numeric simulations.

> Naturally far more efficient

What does this mean exactly? One big reason for Python’s success in data science is numpy, which is far more efficient (especially on large data) than vanilla Python. I’m unaware of the state of the Perl ecosystem, does it have something similar?

> how come Perl lags far behind or gets dwarfed in data science by Python?

Perl seems to be lagging behind in general, no? Is there anything specific about data science where Perl should be shining?


👤 PaulHoule
Circa 2000 I did a lot of unix scripting with Perl and also wrote cgi-scripts for the web with Perl.

In the cgi-script mode Perl had to start a new process and compile all your code for each request. There was "mod_perl" which was more efficient but frequently you struggled with memory leaks and other reliability problems.

PHP came out and then Apache Tomcat, web hosting systems for Ruby, etc. all of which had efficiency similar to mod_perl but easy and reliable environments to work in. Generally there were many modules in CPAN that were essential to web development (HTML escaping) for which bugs were not getting fixed and that added to the feeling that Perl was slipping behind.

As for more general scripting I think people found Python was better. Even though it is cross-platform, Perl has a strong UNIX feel to it. Python doesn't feel like it belongs to Windows, UNIX or any other environment, rather it feels comfortable anywhere.


👤 macksd
Python has a very simple and consistent and unsurprising syntax compared to Perl. I think most programmers from other languages can look at Python and feel like they're reading pseudo-code that they actually understand for most operations. So for people coming from a mathematical or scientific background, it unlocks computing abilities without having to learn a bunch more knowledge from another domain. Add to that the success of libraries like NumPy and SciPy that round out the capabilities of the language itself and make many practical tasks very accessible, and it's just a nuke-from-orbit type situation for most other languages.

I mean, even the author of Learning Perl, said "sometimes Perl looks like line noise to the uninitiated, but to the seasoned Perl programmer, it looks like checksummed line noise with a mission in life."


👤 forinti
PDL is just awesome, but I feel that Perl is a programmer's programming language. If you come from Linux, shell scripting, C/C++, etc, you'll probably handle Perl well.

However, it might be a bit too much if you are from other fields and just want to get your sums in order.


👤 Scarblac
Python got numpy fairly early on. It is fast and powerful.

Then a whole ecosystem was built on top of numpy, with scipy, pandas, PIL, et cetera. Everything that uses n-dimensional arrays used numpy as their base and as a result all those things can be combined trivially. That's very powerful.

Then later came ipython, a much improved interactive shell, and Web based notebooks that are very useful for data science work.

That the language involved is Python isn't even important, imo. Numpy + ecosystem replaced Matlab. All Python has to be is be a better language than Matlab, and it is.


👤 GuB-42
I love Perl too, but it can very quickly become a "write only" language. Which is fine for what I make of it.

One thing to consider is that scientists are not necessarily programmers, there is an overlap, especially in data science, but from my experience, they tend to write terrible code. It is not to be dismissive, it is just a different skill set, there would be no need for professional programmers if scientists could do better and vice versa.

And Python is very interesting in that it is actually difficult to write terrible code with it, forced indentation and clean syntax certainly helps, it also heavily promotes the one "pythonic" way. Contrast with Perl "there is more than one way to do it" philosophy.

It is not that you can't make a huge mess with Python, but it only tends to happen at the intermediate level, like when you are starting to write libraries but have not yet reached mastery.


👤 hprotagonist
tooling and network effects.

perl doesn’t have the numerical chops to keep up, and if it started to fix that now, it has 20 years of headwind to fight through for probably marginal gains. Good luck.


👤 tokai
Is Perl even close to be on a top ten list of languages that make sense for data science? Not to be nasty to the OP, but the question would maybe make sense +10 years ago. Whats baffling now is seeing Perl as a real competitor to anything.

👤 greenthrd_farse
The Perl community is horrible, that is the real reason. Criticism about Perl readability and other Perl language meme critics are easily debunked myths. the Perl language semantics and logic is fantastic and easily readable after you LEARN it. The Perl interpreter is fantastic, performs great and have very helpful warnings and strictures.

However the community is untalented, they historically produce very bad code, it all started with Matt's script archive. They do create horrible sites, PerlMonks is a good example, It looks horrible and its usability is horrible. The code posted by PerlMonks users is mostly very bad. CPAN that is often pointed as positive is just a mostly poorly documented repository of very bad code. There are very few usable modules on CPAN.

Perl community has also written some of the worst technical books ever.

Talent attracts talent, that is the reason Perl is dead for any kind of usage and Python is popular.


👤 neom
I'm not a programmer but I did get my start in tech because I installed Mandrake on my computer for fun in the 90s and couldn't figure out how to format the drive back so I could reinstall windows. I was accidentally forced to learn linux, and with that things like bash and Perl. I feel like people who customize their bashrcs and tweaking their IP tables and generally are comfortable scripting, are people who gravitate towards Perl, but for developers and data science folks, Python is much more accessible in terms of finding out how to learn that application of the language to those types of problem? When I thought about learning some more data science a few years ago, the vast majority of the tutorials on youtube etc are for people shifting from apps like excel to things like sql and python, I didn't see anything about Perl.

👤 Hizonner
C is more efficient than both of them. For that matter, so are tons of other languages.

For that matter, I'm not sure what you mean when you say Perl is "naturally more efficient" than Python. There's nothing about Perl that makes it easy to run faster, and the ability to write incomprehensible one-liners is not a very satisfying measure of "efficiency".

As for why people choose Python over Perl, Perl is a pain in the ass in a lot of ways, and I say this having written thousands of lines of Perl back in the day. Dollar signs on variable names? Obvious bad code turning out to be syntactically correct but do weird stuff, because of strange irregular legacy syntax rules? Library code being unreadable because of the aforementioned incomprehensible one-liners?

Python is bad enough about not finding errors until the code blows up in weird ways, but Perl is worse.


👤 darrenf
A lot of this ground was covered when PDL - the Perl Data Language, born 1996 - was submitted to HN in its own right last summer: https://news.ycombinator.com/item?id=27439638

👤 j7ake
Data science involves building many statistical models, visualizing complex data structures in many ways, and summarizing results into "pretty" figures.

In this way, the statistical packages and plotting software in Python is better than Perl. I would say R is even better than Python and Perl for certain statistical analyses and quickly plotting complex data in different ways.

Perl might beat other languages in wrangling certain types of data like comma separated values or other 2D arrays in terms of writing expressive one-liners. Perhaps that is what you mean by "naturally efficient"? How much one can do in one line of code?


👤 rosetremiere
Not much more than an anecdote, but I found working with julia quite a lot of fun. It was not data science, but since julia is more or less aimed at data science, I'd predict that my positive experience means julia is a very good choice there too.

There is the "time to first plot" issue, and I found that the "efficient as C, easy as Python" motto really means "efficient as C XOR easy as Python", but all in all, it's very easy to write stuff very cleanly, and the path to effiency is quite natural if you know where to look, and the metaprogramming makes it that more powerful.


👤 exploderate
I used Perl for LAMP apps in the 90ties. Perl lost webdev to PHP (some Ruby) and science stuff to Python. Fans of OOP/functional style programming went to Ruby.

System administration never fully replaced awk/sed/bash with Perl and the new wave was all configuration management, like chef and puppet.

Python was considered to be a "clean", algol-style language, so universities started teaching it twenty years ago. Only logical after teaching Pascal for decades. Students kept using Python, so now there are lots of data science projects around.


👤 gnujosh
Actually, Perl was a quite natural choice for a while for some computational biology / bioinformatics workloads. In particular, defining scripts where you expect a sequence (like DNA) as input and a filtered or modified sequence as output allowed for processing pipelines that just flowed nicely: script1 | script2 | final_script.

It's been a while since I was in that field, but I suspect those kind of low level operations are now heavily optimized in faster languages as sequences to operate on became longer and operations more complex.


👤 dbi
In my opinion, as someone who worked a short time in perl, perl is so much harder to learn, especially comming from other programing languages (which most data scientists learn if they started from a degree in computer science/engineering)

One of the main issues I have with perl is the `there is more than one way to do it` slogan, which, for me, means that each person what writes perl, writes practically a different language. This make the bar of starting a new project much higher, even if you have use perl before.


👤 musicale
Probably the same reasons Python is more popular than Perl in nearly every other context.

As noted above, Python generally has a less cryptic syntax and tends to be much easier for non-experts to read.


👤 kthejoker2
Data science is primarily about communication, both of your analysis to other people, including laypeople, domain experts, other data scientists, ops folks ... and those same people communicating with you about their needs, important context, expectations of SLAs or constraints ...

The main barriers to good communication of data science is the translation between the real world goals, data, algorithms, assumptions and priors, scientific and statistical methods being applied, testing, the deployed model's performance and ongoijg monitoring and management, etc .. and the actual code and artifacts produced to achieve those ends.

Python wins because there is a wide ecosystem around those translation efforts.

* Visualization and statistical profiling are first class citizens, probably the killer app in terms of communicating difficult mathematical concepts

* Easy to extend, very "framework-friendly", so you can "speak the same language" between data engineering, DS, analysts, MLOps folks

* Needless to say, network effects of a community used to Pythonic idioms


👤 uberman
Is perl more efficient than python? I don't know perl at all but the head to head performance searches I just did seem to suggest it is not.

👤 soueuls
I think efficience is not that important. When I was doing research in China, we would use python to iterate quickly and visualize what we were doing. Once we were done, either performance was not really an issue, or we would just translate the code to C++ anyway

👤 bee_rider
I'm ignorant of Perl, so it is possible that this is off base, but:

If I were writing something where I actually cared about low-level performance (a 'let's see what we can get the compiler to inline and unroll' sort of code), I guess I'd

1) start by writing pseudocode

2) code it in C or Fortran.

The fact that there's a runnable version of pseudocode called Python means that often people will stop at the first step, realize computers are incredibly fast, and be happy enough with not writing the Fortran (just sprinkle in some NUMPY for the crunchy bits).

Lots of cases can be handled with large calls to heavily tuned libraries anyway, where most programmers won't beat the library in C or Fortran, let alone Perl.


👤 superkuh
Perl had a mis-step with Perl "6" that caused a lot of the userbase to drift off.

👤 afarrell
Because it was used by a newsroom in Lawrence, Kansas and they wrote a web framework in python called Django. Their journalistic culture led Django to have really great documentation, which led to an influx of more casual developers who had an expectation of great documentation to make up for the fact that their main expertise was elsewhere. This made libraries with good docs more successful, making it easier for university courses to choose python as an initial language. This made it easier for university labs to agree on python as a language. This led Travis Oliphant & friends to develop numpy etc.

👤 ProofByAccident
Does Perl have mature equivalents for pandas and sklearn (setting aside what everyone else is saying about numpy)? The python ecosystem has a bunch of killer apps that make the workaday tasks of data science extremely ergonomic. R is similar with the tidyverse imo, but I don’t know of other languages with a comparable package landscape.

Quick addendum: data science != computer science, most data scientists learn coding on top of another skillset, not as their primary area of expertise, so things like under-the-hood efficiency are often second order concerns to learn-ability, ease of use and maintenance.


👤 drbwaa
Readability/accessibility matter. Readability/accessibility matter A LOT when a large portion of the userbase has no formal background in programming.

👤 tonetheman
It is write only and super hard to read for mere mortals. That alone is probably the main reason. Python is just so much easier to parse for humans.

👤 DonHopkins
Why are lead pipes and asbestos not popular any more?

👤 rgavuliak
Perl doesn't have the tooling/library ecosystem Python does for Data Science. Additionally lot of Data Science people come from maths/stats and Python is easier to begin with. In my experience most Data Scientists aren't full blown devs since you focus on different things (business aspects of what you do vs scalability).

👤 eternityforest
Python spends most of it's time in C extensions for most applications, so performance isn't as big of an issue.

Perl is.... perl. If you read it it's not at all obvious what's happening. They don't believe in the one and only one obvious way principle.


👤 JohnHaugeland
> Naturally far more efficient

What? No it isn't

.

> how come Perl lags far behind

Because nobody wants to use it, same reason it lags far behind in literally every other field in programming, too

If you actually want to understand this, learn why CGI doesn't point at Perl anymore. That was its last real stronghold.


👤 brnt
Perl? What year is this?

(Only half joking, everything is dwarfed by Python in the data science community.)


👤 nathias
the bottleneck isn't the machine runing code, it's the people reading the code

👤 cozzyd
Before Python, Matlab was the most widely used language for the same use case, but numpy/matplotlib essentially allows you to write Matlab in Python, and Python is much more ergonomic than Matlab for "business logic"

👤 stillbourne
Perl was my first language in 2001. I now know 10 different languages and perl is the worst one to read and my second least favorite language I know right after php. I would rather eat my own vomit than ever use perl again.

👤 zaptheimpaler
Perl is what I am most proficient in, and have already completed various AI projects with, but my colleagues tell me it will be worth it to learn how to program in python, even though I will be set back in the short term.

👤 perth
This doesn’t answer your question, but Python has no equivalent for Perl Pie or the other inline terminal features, so for shell one-liners Perl is still heavily relevant because Python doesn’t offer the same functionality.

👤 wyuenho
Well, for the longest time perl couldn't distinguish an int from a long when doing computation could be one reason. Perl is just not designed for any kind of math.

👤 vmchale
Numpy probably.

> Naturally far more efficient

Most data science languages are interpreted with a few higher-level routines. Same with J, for instance - it has adequate performance as well.


👤 mcdermott
Because Perl is cryptic and hard to learn unlike Python, so Students are taught Python and take that knowledge with them to work.

👤 Fiahil
Because no one wants to write Perl ?

Also because Python is readable by most software engineers and data scientists, unlike R.


👤 GnarfGnarf
Priorities for code:

[1] Correct (do what it's supposed to do). [2] Maintainable (understandable, debuggable). [3] Fast.

In that order.


👤 oceanghost
Does Perl still have a hand-written parser because it wont conform to a BNF?

That's why I won't touch it.


👤 jofer
I started using python for data science-esque tasks back in late undergrad/early grad school before python had really caught on for scientific computing (i.e. ~2003-2005 back before numpy proper, when numeric and or numarray where the containers of choice).

At the time, I did actually use perl a fair bit for data munging. Perl is a lot nicer for anything that required lots of system calls and involved a lot of pure text processing, but that only goes so far. The standard pattern was pre-process ascii data in perl, write ascii or simple binary formats out to disk, invoke some system executable (often something written in fortran) on the file you've written out, read back in the output. Perl is definitely nicer than python for that workflow. However, that workflow has severe limitations. The roundtrip to disk / stdout / etc is pretty crappy for some things.

There really weren't good numeric data containers in perl, at least that I was aware of at the time. Even before numpy, there was numeric. Numpy/numeric focus on c-like in-memory arrays that can be semi-directly passed into / referenced from low-level libraries. That's huge -- suddenly it's easy to manipulate large numeric datasets in memory and _maintain memory efficiency_. No linked lists, very clear rules about what creates intermediate copies, etc. You then can pass these directly into C / Fortran routines without a copy in many cases. (Okay, that last part is non-trivial, especially at the time, but very possible.)

Then there's plotting. Folks forget just how interactive matplotlib is, and was from the very early days. From the perl side, I was using gnuplot/etc (and even more of a domain specific tool called GMT). That meant static figures. Matplotlib meant I got an interactive figure and something that I could easily embed in Tk to make quick GUIs.

I also used Matlab heavily at the time, but it was pretty difficult for the things that needed to interact with everything else (read: old F77 routines and proprietary domain-specific data processing tools). Licensing was also an issue, as there were a limited number of matlab licenses, and you couldn't reliably count on being able to check one out, especially for cron-esque jobs.

Python bridged the two. You had a matlab like environment, decent data munging ability, a good language, and also a good environment for building other tools. This was all possible in Perl, in principle, but the key tools weren't there in Perl, even almost 20 years ago. Basically Perl couldn't replace Matlab easily and Python could.

So why didn't they get built in Perl instead of Python initially? I suspect the short answer is operator overloading. Python is _really_ nice for that, and it's a very nice way of having flexible array manipulation syntax. Second to that is that Python is more readable, and readability matters in the long term.

Also, don't discount how big of a deal having Tk support by default in python is, though. Yeah, sure, these days folks completely ignore desktop GUIs, but at the time web apps were pretty irrelevant. Desktop GUIs were everything. Being able to whip up a quick reusable gui data processing application that a random lab assistant or new grad student could easily use was/is a very big deal, and that was way easier in Python than most other things, especially at the time.


👤 SavantIdiot
What is PERLs version of numpy?

👤 yuppie_scum
Python is easier

👤 oversocialized
Perl was already going the way of the dodo in late 2000s. Perl 6 finished it off.