I recently quit academia (social sciences) and I plan to transition into a data science career within the next year or so. This is the part I liked about my academic job and the stuff I am good at (statistics, analytical problem solver, data wrangling/modeling).
Reading through job ads the technical skill palette for DS seems overwhelming: python (+pandas, scikit, ...), R, docker, k8s, PowerBI, Tableau, PostgreSQL, DevOps/pipelines, different cloud providers, maybe add some javascript, various ML toolkits (Tensorflow, etc.).
I have 15 years of experience as a statistician (R/STATA/Linux/SQL) and a sabbatical year in front of me, which new skill(s) should I learn/prioritize?
Edit: I have a PhD.
Thanks!
Skills for someone like you to work on - Python, the Python data ecosystem, machine learning, deep learning, being a good software developer.
Things not to worry about right now - Kubernetes, DevOps, cloud providers.
You don't need JavaScript. Don't learn any Tableau/PowerBI and don't apply for jobs that require them unless you want a more analytics/business intelligence focused role. Or do learn them if you want to go in that direction but those jobs are quite different even if they have the same job title.
(If a job description asks for TensorFlow/PyTorch and PowerBI/Tableau, it means that they have no idea what they're looking for whatsoever.)
Maybe I should have started there - figure out if you want a more analytics/product/decision-making kind of role or more of an applied ML kind of role and then focus on that skillset. For the applied ML kind of data science job, you need the skills that I listed, for the other kind you need the stats background that you already have, some SQL, much less coding and a couple of BI tools.
- python DS ecosystem; fundamentals: numpy pandas matplotlib seaborn sklearn scipy. From there you can branch in many different directions - interactive visualization libraries (e.g. plotly / bokeh), stats / probability stuff (statsmodels / pymc3), NLP, sklearn addons, ML explainability, ...
- solid "software engineering" - writing good code, unit tests, documentation, logging, basics of deploying a service
- TF / pytorch if you want to get into the deep learning hype
Best of luck, and more importantly, enjoy!
The thing about data sciencey stuff is that the data you're working with will often be extraordinarily messy. It's not uncommon to realize two weeks into a project that you made a terrible cleanup assumption on day one... then have to run hours and hours of mostly-hand-executed ETL work again. And then again when you realize you forgot that you ran a unix one-liner on a random input file, but didn't write it down anywhere.
So, one of the most important things to learn off the bat is learning how to clean data programmatically, specifically with the goal of making sure that your cleanup is repeatable at any step of the way. You want to be able to get to a point where you feel confident that you can mostly trivially recover after deleting all cache/temp files, tables, etc. Makefiles are great for this.
You'll save a lot of time in the long run if you can get good at this.
- Data Scientist MIGHT mean Applied ML Scientist (important distinction) - Data Scientist MIGHT mean Data Engineer - Data Scientist MIGHT mean ML Engineer - Data Scientist MIGHT mean Data Analyst - Data Scientist MIGHT mean Statistician - Data Scientist MIGHT mean Product Analyst
The traditional idea of a data scientist for the last decade or so is someone who is able to do insight extraction, create models (ML and otherwise) and build dashboard and presentations. In practice this has mostly proved to not be practical and little value is being extracted from the role, so a mature organization will properly break out the responsibilities into the above mentioned functional areas.
I think it's super important that you reach out to industry data scientists at companies or in domains you're interested in and ask them what it is that they actually do on the daily. Be careful with most data science roles as they are really just data analyst roles in disguise.
Very few true data "science" roles exist, and I'd argue that might you not actually want to work in those roles since they likely exist in companies that have no idea what they want out of them.
That being said I think the 2 absolutely crucial technical skills to have for almost any modern data related job will be:
- Python + Pandas - SQL
That's really going to be the technical foundation to make a data career in industry.
The breadth of other skills and required knowledge is too much for any single person or post to tell you.
I'm not entirely sure what your PhD is in, but if it has an associated domain in industry and if you're actually interested in that domain, I would recommend starting there and seeing what Data Scientists with similar academic backgrounds as you might be doing. LinkedIn is a great place to find people and connect!
Best of luck.
Where you are you based btw? I have a ton of Parallel Computing experience (PhD too). My weakness is on the math/stats side. Happy to give more specific advice. Market is hot right now .. don't delay.
Self study is not a job.
It's attractive because it's easier than looking for a job.
Or to put it another way, job hunting is the key skill for any career transition.
Good luck.
Aside from programmatic and cloud tools as identified in your post, one of the biggest hurdles is whittling down your academic CV into a resume. Spending time re-framing your academic accomplishments in the short form will be the best time investment for getting in for interviews. I ended up following the google XYZ resume formula: https://www.inc.com/bill-murphy-jr/google-recruiters-say-the... It kind of hurts to distill your academic achievements into "Published [X] peer reviewed papers [Y] by driving the analysis [Z]", but I think it really helped me start getting calls vs. desk rejects. Relatedly, only include publications that either highlight your expertise for a specific job posting or if they further highlight your expertise in statistics in a way that could set you apart from other candidates.
Python being the main language to focus on because it's great for working with data and general scripting needs (working with files, etc). For data everything from basic data access to a lot of math and statistics that you would find in R.
I would recommend some JavaScript so that you have the ability to easily read it and because I feel that when you learn multiple languages it improved your skills overall for each language. Doesn't have to be long - perhaps a few days of focused learning (or even a day or less).
Recently I have been working on something related to using a lot statistics, analytics, etc and have been using Python the most for it and actually some SQL as all major databases now support Percent Rank and other statistical functions. For my project I'm using JavaScript a little for work such as Web Scraping but most work is done in Python and then final reporting is SQL.
Python is also great for Machine Learning. Even for basic API access with TensorFlow I prefer using Python over there JavaScript API.
Good luck with your new career!
In terms of prioritisation, Python and general software engineering skills multiply your ability to deliver stuff but also learn and experiment with new libraries. If all you knew was logistic regression or gradient boosting, but you could deliver a whole pipeline and insights on top of it, you’d be extremely valuable to most businesses. You’re also set to fill those roles in future startups you found yourself, where needed.
It’s true you can also do world class research and never leave Jupyter - that’s valid too, but probably more hit and miss as a career.
Generally the highest value for your background will be Python (do a bit of leetcode), Pandas and Sklearn. PyTorch would also compliment your existing skill set.
A lot of the other stuff you can just learn on the job like cloud, dev ops etc. it all differs by where you work. It’s also the shiniest stuff that you can get distracted by and waste time on for less value.
- Think about what are your strongest applicable skills
- Talk to as many DSes/MLEs/MLOps engineers you can
- Experiment with various fields (watch videos, OSS work etc)
- Find comapnies that you actually want to work for- Find out what DS means at those places and what do they do and if that's what you want to do. Ask as many questions about the details of the job as possible at the interview, the hiring manager will be glad you want to avoid getting a job you are not committed.
Essentially: Do your homework then make a decision that is good for _you_ rather than trying to fit to an abstract idea.
> The Amazon Scholars program has broadened opportunities for academics to join Amazon in a flexible capacity, in particular part-time arrangements and sabbaticals.
https://www.amazon.science/scholars#:~:text=Amazon%20Scholar....
Source: https://www.kaggle.com/code/nomilk/data-science-language-and...
That aside, I work with many data scientists at clients. Most companies are still windows shops. Some may get linux. I've never seen a data science group using STATA successfully. R is ok, but in my experience is falling out of favor rapidly. While R is great for data science, it's relevance as the glue for other things is not so great. I would softly advise python.
I wouldn't bother learning viz tools like tableau beyond basic familiarity enough to slap it on your resume as something you've touched. It's all company specific. Same for the cloud and pipeline shit.
Think hard about what kinds of problems you want to solve. Most problems, that is, every single problem I have worked on, are poor fits for neural networks. PyTorch has never been relevant. Real world business problems just don't really benefit from that kind of stuff all that much. A real data science value add is picking up the low hanging fruit by being smarter about decisions that used to be made on gut instinct or whatever. Unless you really want to work on computer vision or whatever, it's just not something you need to bother with. I typically end up using a lightGBM model for pretty much everything at the end of the day. Which is basically just a fancier random forest.
Many data scientists shops fail to achieve anything because the data scientists are too complacent. Be a business person. Show a willingness to engage on problems and grill business folks for how they make decisions, and discuss how your model could change that process to add value. Don't make book reports on your findings and expect them to figure out how to use it. It's so, so common to have data scientists who don't feel qualified to take that part of the job on, and so they build nice models and visuals that everyone applauds and then collect dust on a shelf. Every output should be clearly dictating a path to generating value, be it money or some other worthy metric.
Highlighting that you're someone who can use their data science to solve real problems will be much more appealing in interviews than someone who can say they data science things the best. IMO, good question would be "I want to make sure I'm joining a group that has the power and support to really influence how decisions get made. Can you give some examples of the work that the business has adopted from your outputs?". Both from a virtue signaling perspective and a genuine desire to avoid joining a back office data science skunkworks that nobody listens to.
edit: I'm on the east coast, and not interested in working for big tech, for context. All of my examples here are from experience working with "normal" companies from dozens of industries, but no west coast tech.
(1) Do the people hiring really have an actual real need?
(2) Do these people really know what skills they need or are they asking for all the skills they can think of to have confidence that the people they hire will have enough skills, no matter what the need?
(3) Given a long list of skills, there are nearly no jobs that actually need and will use all those skills. With a long list of skills in a job that uses only a few of them, soon the skills that the person still knows and are proficient with will shrink. Then what happens to the career, that is, to ability to have a chance to get hired from job ads that have a long list of skills?
(4) Don't forget that one reason for job ads with a collection of skills that is long and particular is because the people placing the ad already have a candidate for the job who claims to have those skills and someone is wondering how easy it would be to hire someone else with those skills.
(5) My experience is strongly that nearly no one in business with enough money to hire technical people actually (a) has work that they understand and that needs technical skills and (b) knows what skills are needed.
(6) Really, if they are recruiting for a long list of skills, then likely they are offering at best a version of gig work and not a real job and career.
(7) Learning a long list of skills prior to getting a job seems to mean there is something wrong with the job being applied for.
(8) My strong experience is that any very productive use of any specialized skills will totally torque off and threaten essentially everyone in the management chain all the way, literally, to the BoD.
(9) At Easter dinner this year, I learned a good lesson: There actually are employers who are ready, willing, eager, and able to hire people with just good, basic abilities and some background in some technical area for a job they have where: (a) They actually do have important work to be done and do understand the importance of that work. (b) Want their people to attack the work and learn any necessary material as they go. They don't ask the people they hire to have a long list of skills before they are hired.
(10) I'd suggest doing what have to do to get a basic income now and otherwise start a business. In that business, learn the skills the business needs. Typically for the technical skills, the list might be particular to the business but otherwise is not very long.
Lesson: In the end, in nearly all of the US economy, to have a lot of money to pay people to have good jobs there has to be a business founder who found some important problem to solve, got a good solution, and got a lot of people to pay for that solution. So, BE one of those people.