- Perl actually was really popular for a while back then, especially in the Bioinformatics/Genomics space. It was all over that field, I think partially because it's really easy to think of a genome as just a text string of ATCGs and Perl was really convenient for manipulating text.
- I originally switched to Python for some projects because I had to do simple GUIs and visualization. Pygame and TKinter were much nicer to deal with than Perl's options. If you were just reading data in from one file and writing it out to another, Perl was fine, but the GUI toolkits were miserable.
- Numpy ("Numeric" at the time) was what really sealed the deal and probably pushed a lot of others to switch. Perl had PDL and similar but they weren't as fast or as easy to understand and use. Even if Perl is "Naturally far more efficient" than Python, it's still slow enough that for non-trivial calculations you would not want to use it directly. You have to use one of these other libraries where the actual number crunching is happening in highly tuned FORTRAN or C libraries. Python's options back at that time just leapfrogged Perl's and Numpy had really nice integration with Pygame, which made visualization in your app really smooth. Perl might've caught back up, but that was around when Perl6 was announced and sucked all the air out of the room.
So readability matters. And Python is one of the most readable languages out there. It remains relatively readable even as the code and data structure get very complex. Its syntax resembles math and pseudo-code more closely than other languages. It feels more like a tool of thought, not just a bunch of alien hieroglyphics you have to write to crudely and inefficiently express how you really think about the problem.
Perl, on the other hand, is one of the least readable languages for a general audience. It is not a tool of thought for non-experts. Its syntax is ugly, obscure, overly symbol-laden, and beloved only by gurus. The supposed "feature" of TIMTOWTDI just makes it more obscure.
The thing I can never get over with Perl is that it didn't even have a sane, idiomatic notation for functions until recently (and even today I suspect it's not widely adopted). There was no such thing as just defining a function f(a,b) of multiple named formal arguments. You had to use some special magic variable and people start talking about shift operators or some such nonsense.
This is the point at which I start saying "yes, all the language choices are Turing complete, but that doesn't mean they're all equally effective choices".
However, for data science, the biggest drawback vs Python is how you work with complex data structures. Because everything in Python is an object, (de)referencing things is generally straightforward, like:
somelist[12]['whatever']="abc"
mylen=len(somelist)
Where Perl devolves into a mess of sigils, when to use $ vs % vs @, wrapping references with {}, etc. $somelist[12]{'whatever'}="abc";
$mylen=length(@somelist);
This gets more complex as the nesting deepens, or as you add things like objects, want to iterate keys of a dict, and so on.Edit: A real example from the perl data structures (perldsc) man page:
print "it turns out that $TV{$family}{lead} has ";
print scalar ( @{ $TV{$family}{kids} } ), " kids named ";
print join (", ", map { $_->{name} } @{ $TV{$family}{kids} } );
dataframe["A"] + dataframe["B"]
and get a Series. You can make things that look like numbers or arrays but actually do something different.Both Python and Perl made bold transitions, Python from 2 to 3 and Perl from 5 to 6.
The Python transition was difficult but ultimately successful. On the other hand, Perl rolled the dice and lost. In some alternate universe it might have been the other way around.
Here's a bunch of things I was dealing with on a daily basis at the time:
* No revision control - sometimes files got deleted and research was lost
* Code that worked by accident, e.g. accessing the first element in an array using @array[$0] instead of @array[0], which worked only because $0 evaluates to "myscript.pl", and @array["myscript.pl"] returns the first element in the array.
* Some people just liked their code left-indented (read: no indentation at all anywhere)
Most of the code was Perl and Matlab. Perl was hard to read, but it was good for text processing, not so much numerical processing, and Matlab was good for numerical processing, but it was slow and bloated.When Python came along, with batteries included and with science-friendly libraries like numpy and scikit (so not just text processing, but proper numerical processing) you could get away with coding all your stuff in Python (though a lot of stuff is still Matlab). And on top of that it was easier to learn and to read than Python and it was fast enough. It was a no-brainer, so newcomers stopped learning Perl.
That said, Python's biggest contribution to this world, in my opinion, was forcing people to indent their goddamn code, because, as super-smart as these science guys generally were, they were also super-stubborn.
As for me, I moved from academia into the industry and it's so much nicer to work with other software developers. Sometimes I wonder if I should stop making some billionaire richer and go back to contributing to the scientific field, but, unfortunately, I have bills to pay. Maybe when I retire.
I do come across some perl every now and then but it looks like Bash on steriods to me (perhaps because of the $variables), and I never see any DataFrames or similar structure that looks familiar to me.
Idk, Perl never occurred to me and nobody ever recommended it. Is Perl good for data science? Do you have any examples? I never ran into "efficiency problems" with Python btw, I'm not using it at that scale, at what scale would I notice this? I have run out of ram at times, but that's usually when a dataset on disk is already larger than my ram. But then I usually find a tool to deal with the data anyway (for example pyosmium for super large osm/o5m fles).
Edit: I feel that Python also gives me other nice things to get started with, things like Django and Snakemake. This also leads me to recommend Python to other people, it's a broad basis for a lot of stuff. That's why even though some people recommended R when I got started, I choose Python anyway. I have no regrets, unless you are going to blow my mind with some examples...
Python also got heavyweight libraries such as Numpy and Pandas which put it in the front. Perl do not have have such well known libraries as far as I know.
> Naturally far more efficient
What does this mean exactly? One big reason for Python’s success in data science is numpy, which is far more efficient (especially on large data) than vanilla Python. I’m unaware of the state of the Perl ecosystem, does it have something similar?
> how come Perl lags far behind or gets dwarfed in data science by Python?
Perl seems to be lagging behind in general, no? Is there anything specific about data science where Perl should be shining?
In the cgi-script mode Perl had to start a new process and compile all your code for each request. There was "mod_perl" which was more efficient but frequently you struggled with memory leaks and other reliability problems.
PHP came out and then Apache Tomcat, web hosting systems for Ruby, etc. all of which had efficiency similar to mod_perl but easy and reliable environments to work in. Generally there were many modules in CPAN that were essential to web development (HTML escaping) for which bugs were not getting fixed and that added to the feeling that Perl was slipping behind.
As for more general scripting I think people found Python was better. Even though it is cross-platform, Perl has a strong UNIX feel to it. Python doesn't feel like it belongs to Windows, UNIX or any other environment, rather it feels comfortable anywhere.
I mean, even the author of Learning Perl, said "sometimes Perl looks like line noise to the uninitiated, but to the seasoned Perl programmer, it looks like checksummed line noise with a mission in life."
However, it might be a bit too much if you are from other fields and just want to get your sums in order.
Then a whole ecosystem was built on top of numpy, with scipy, pandas, PIL, et cetera. Everything that uses n-dimensional arrays used numpy as their base and as a result all those things can be combined trivially. That's very powerful.
Then later came ipython, a much improved interactive shell, and Web based notebooks that are very useful for data science work.
That the language involved is Python isn't even important, imo. Numpy + ecosystem replaced Matlab. All Python has to be is be a better language than Matlab, and it is.
One thing to consider is that scientists are not necessarily programmers, there is an overlap, especially in data science, but from my experience, they tend to write terrible code. It is not to be dismissive, it is just a different skill set, there would be no need for professional programmers if scientists could do better and vice versa.
And Python is very interesting in that it is actually difficult to write terrible code with it, forced indentation and clean syntax certainly helps, it also heavily promotes the one "pythonic" way. Contrast with Perl "there is more than one way to do it" philosophy.
It is not that you can't make a huge mess with Python, but it only tends to happen at the intermediate level, like when you are starting to write libraries but have not yet reached mastery.
perl doesn’t have the numerical chops to keep up, and if it started to fix that now, it has 20 years of headwind to fight through for probably marginal gains. Good luck.
However the community is untalented, they historically produce very bad code, it all started with Matt's script archive. They do create horrible sites, PerlMonks is a good example, It looks horrible and its usability is horrible. The code posted by PerlMonks users is mostly very bad. CPAN that is often pointed as positive is just a mostly poorly documented repository of very bad code. There are very few usable modules on CPAN.
Perl community has also written some of the worst technical books ever.
Talent attracts talent, that is the reason Perl is dead for any kind of usage and Python is popular.
For that matter, I'm not sure what you mean when you say Perl is "naturally more efficient" than Python. There's nothing about Perl that makes it easy to run faster, and the ability to write incomprehensible one-liners is not a very satisfying measure of "efficiency".
As for why people choose Python over Perl, Perl is a pain in the ass in a lot of ways, and I say this having written thousands of lines of Perl back in the day. Dollar signs on variable names? Obvious bad code turning out to be syntactically correct but do weird stuff, because of strange irregular legacy syntax rules? Library code being unreadable because of the aforementioned incomprehensible one-liners?
Python is bad enough about not finding errors until the code blows up in weird ways, but Perl is worse.
In this way, the statistical packages and plotting software in Python is better than Perl. I would say R is even better than Python and Perl for certain statistical analyses and quickly plotting complex data in different ways.
Perl might beat other languages in wrangling certain types of data like comma separated values or other 2D arrays in terms of writing expressive one-liners. Perhaps that is what you mean by "naturally efficient"? How much one can do in one line of code?
There is the "time to first plot" issue, and I found that the "efficient as C, easy as Python" motto really means "efficient as C XOR easy as Python", but all in all, it's very easy to write stuff very cleanly, and the path to effiency is quite natural if you know where to look, and the metaprogramming makes it that more powerful.
System administration never fully replaced awk/sed/bash with Perl and the new wave was all configuration management, like chef and puppet.
Python was considered to be a "clean", algol-style language, so universities started teaching it twenty years ago. Only logical after teaching Pascal for decades. Students kept using Python, so now there are lots of data science projects around.
It's been a while since I was in that field, but I suspect those kind of low level operations are now heavily optimized in faster languages as sequences to operate on became longer and operations more complex.
One of the main issues I have with perl is the `there is more than one way to do it` slogan, which, for me, means that each person what writes perl, writes practically a different language. This make the bar of starting a new project much higher, even if you have use perl before.
As noted above, Python generally has a less cryptic syntax and tends to be much easier for non-experts to read.
The main barriers to good communication of data science is the translation between the real world goals, data, algorithms, assumptions and priors, scientific and statistical methods being applied, testing, the deployed model's performance and ongoijg monitoring and management, etc .. and the actual code and artifacts produced to achieve those ends.
Python wins because there is a wide ecosystem around those translation efforts.
* Visualization and statistical profiling are first class citizens, probably the killer app in terms of communicating difficult mathematical concepts
* Easy to extend, very "framework-friendly", so you can "speak the same language" between data engineering, DS, analysts, MLOps folks
* Needless to say, network effects of a community used to Pythonic idioms
If I were writing something where I actually cared about low-level performance (a 'let's see what we can get the compiler to inline and unroll' sort of code), I guess I'd
1) start by writing pseudocode
2) code it in C or Fortran.
The fact that there's a runnable version of pseudocode called Python means that often people will stop at the first step, realize computers are incredibly fast, and be happy enough with not writing the Fortran (just sprinkle in some NUMPY for the crunchy bits).
Lots of cases can be handled with large calls to heavily tuned libraries anyway, where most programmers won't beat the library in C or Fortran, let alone Perl.
Quick addendum: data science != computer science, most data scientists learn coding on top of another skillset, not as their primary area of expertise, so things like under-the-hood efficiency are often second order concerns to learn-ability, ease of use and maintenance.
Perl is.... perl. If you read it it's not at all obvious what's happening. They don't believe in the one and only one obvious way principle.
What? No it isn't
.
> how come Perl lags far behind
Because nobody wants to use it, same reason it lags far behind in literally every other field in programming, too
If you actually want to understand this, learn why CGI doesn't point at Perl anymore. That was its last real stronghold.
(Only half joking, everything is dwarfed by Python in the data science community.)
> Naturally far more efficient
Most data science languages are interpreted with a few higher-level routines. Same with J, for instance - it has adequate performance as well.
Also because Python is readable by most software engineers and data scientists, unlike R.
[1] Correct (do what it's supposed to do). [2] Maintainable (understandable, debuggable). [3] Fast.
In that order.
That's why I won't touch it.
At the time, I did actually use perl a fair bit for data munging. Perl is a lot nicer for anything that required lots of system calls and involved a lot of pure text processing, but that only goes so far. The standard pattern was pre-process ascii data in perl, write ascii or simple binary formats out to disk, invoke some system executable (often something written in fortran) on the file you've written out, read back in the output. Perl is definitely nicer than python for that workflow. However, that workflow has severe limitations. The roundtrip to disk / stdout / etc is pretty crappy for some things.
There really weren't good numeric data containers in perl, at least that I was aware of at the time. Even before numpy, there was numeric. Numpy/numeric focus on c-like in-memory arrays that can be semi-directly passed into / referenced from low-level libraries. That's huge -- suddenly it's easy to manipulate large numeric datasets in memory and _maintain memory efficiency_. No linked lists, very clear rules about what creates intermediate copies, etc. You then can pass these directly into C / Fortran routines without a copy in many cases. (Okay, that last part is non-trivial, especially at the time, but very possible.)
Then there's plotting. Folks forget just how interactive matplotlib is, and was from the very early days. From the perl side, I was using gnuplot/etc (and even more of a domain specific tool called GMT). That meant static figures. Matplotlib meant I got an interactive figure and something that I could easily embed in Tk to make quick GUIs.
I also used Matlab heavily at the time, but it was pretty difficult for the things that needed to interact with everything else (read: old F77 routines and proprietary domain-specific data processing tools). Licensing was also an issue, as there were a limited number of matlab licenses, and you couldn't reliably count on being able to check one out, especially for cron-esque jobs.
Python bridged the two. You had a matlab like environment, decent data munging ability, a good language, and also a good environment for building other tools. This was all possible in Perl, in principle, but the key tools weren't there in Perl, even almost 20 years ago. Basically Perl couldn't replace Matlab easily and Python could.
So why didn't they get built in Perl instead of Python initially? I suspect the short answer is operator overloading. Python is _really_ nice for that, and it's a very nice way of having flexible array manipulation syntax. Second to that is that Python is more readable, and readability matters in the long term.
Also, don't discount how big of a deal having Tk support by default in python is, though. Yeah, sure, these days folks completely ignore desktop GUIs, but at the time web apps were pretty irrelevant. Desktop GUIs were everything. Being able to whip up a quick reusable gui data processing application that a random lab assistant or new grad student could easily use was/is a very big deal, and that was way easier in Python than most other things, especially at the time.