Difficult Interview Question
I am interviewing candidates for data engineering role we have. One of the most critical questions I ask is:
"How can you transfer a file to another machine?"
I can't get anyone in an interview to answer this. I never get sftp,scp,rsync,email,usb,nas or s3 buckets/gsutil. Nothing. Nope.
I want to get into cool topics, parallel transfers, etc, nope.
Help. Is this question dated?
Is it relevant to the role? Has someone in that role/with that title performed that type of work in recent memory? If so, it's not dated.
Is this one of those file under (pun intended) Gen-whatever doesn't know anything about files or filesystems because of their iphone/mobile upbringing things?
Now I'm really curious what kind of answers you're getting. If literally nothing, sounds like this is a great filter (but also makes me wonder what kind of candidates you're choosing to talk to).
If nothing else would assume they have used GitHub or say 'drop it in teams'. Even if they have only ever used the cloud there is send a link from my Google or One Drive. My first answer would be FTP or rsync. Maybe they are over thinking it and assume they need to get it on another machine without the owner knowing about it IDK. Maybe it needs to be reworded. Or follow up if there is no answer and say how did you turn in your homework to the professors or think about how you collaborate with other developers on a project.
dang maybe I should pivot into data engineering. I'm great at that stuff. What is data engineering?
Maybe phrase it with a bit more specificity. "How can you transfer a file between pipelines", services, etc.
I suppose some people might freeze because it's so vague and they're wondering what you're getting at. Personally, I'd say, "Well, in most cases I use rsync or scp, but it depends on the situation."
Then I'd tell the story about how one time in the mid-90s I needed to transfer a huge (probably 100MB at the time) file between a Unix system and a Windows system in our office. After FTP, Samba, and a couple other methods ran painfully slow for unknown reasons, I discovered that a DCC in IRC flew at top speed.
But I'm not desperate for a job and trying not to say the wrong thing.
> I can't get anyone in an interview to answer this. I never get sftp,scp,rsync,email,usb,nas or s3 buckets/gsutil. Nothing. Nope.
They literally don't respond at all? To be fair, I'd presume it's some sort of trick question, but a few follow-up questions (how big is the file? are there any security/privacy/etc rules involved? is there something else unusual about the computers (and are they local "machines", cloud VMs, etc)?) would resolve that quickly. Here I am practicing brainteasers and leetcode and I could get ahead by knowing you can send a file in an email.
Not dated, but with zero context like that, I'd assume it was a trick question trying to see if I can come up with types of scenarios where sneakernet outperforms digital transfers.
Seems like a great filter.
A while ago I was interviewing candidates for a senior frontend position. My filter question was to explain how to make a progress bar. Most candidates couldn't do this, one said it was "not what they were expecting" and that they were just expecting leetcode problems.
(for non frontend people, it's just a styled box in a box....)
Something bothers me about these questions. In the real world, when you're solving a problem, you have so much context. These questions are like waking up from a coma and you're in a video game and you don't even know the rules.
Obviously in the real world you need to ask follow up questions sometimes, but you have at least some context for orientation.
Sounds like a great weed out question if part of the job is moving data around.
What do they say when you ask?
Do people not start asking questions like, over what medium? Is there direct ip connectivity or nat/a firewall in between? How longs the link? How big is the file? I would try and set some parameters if people are not or are struggling.
data engineering doesn't do data transfers... different subjects altogether
I would answer it like this:
For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.
Maybe if they are having problems, offer a clearly sub-optimal way to do it, and try to get them to talk about why it's not great or what better ways might be. That might help uncover some thought pattern or understanding about it. Such options could be FTP, as an email attachment, or the body in a single HTTP POST.
They should be able to come up with thoughts about security, speed, reliability, etc.
Seems like a fantastic filter question to me. It's basic, it's not an unfair or trick question, there are multiple valid options that the candidate can respond with - you're not fishing for your single pet solution - and a strong candidate could demonstrate ability by going into more depth and mentioning multiple options and explaining when one option would be preferred vs others. Keep asking it.
If you're seeing a high rate of candidates who can't answer it at all, that suggests that the population of candidates who are applying for your roles are really poor fits for the role.
Maybe worth exploring if there's ways you could change how the roles are advertised to access a different population of candidates, or perhaps ask this basic question earlier in your hiring pipeline -- e.g. as an automated screening question as a pre-req to interviewing with humans.
Why not word it slightly differently:
“Tell me about a time you moved a file from one machine to another…”
Then have follow-ups ready to go to introduce constraints or issues that would explore the depth of their knowledge. Say for example, “What if that file had been 1000x bigger?” Or whatever.
I think you’ll get more of what you’re looking for with this approach.
These kinds of discussions remind of those times running a TTRPG and there is a puzzle to solve and 4+ engineering students can't figure out the back-of-the-cereal-box type puzzle.
So there is something going on that makes problem solving in a situation very challenging.
It's like it's hard to concentrate when the lizard brain is playing bongos on the Big Red Danger button it has access too.
I once asked a senior developer "how will you figure out which class/method is causing the server to crash". They gave me all sorts of answers other than "i'll add appropriate log statements and check the logs".
anyone who has actually worked for a while with large data sets should be able to answer your question. but since it "sounds simple", it probably causes most folks to freeze or go onto a tangent.
Unfortunately, because of how our industry treats interviews, people "prepare" for it and that puts even good folks in the mindset of prepared answers rather than practical answers.
You know what, just wait. When the person comes along you will know it.
If I'm trying to transfer a public ssh key for the first time I sometimes use netcat ;)
(Never nc >> .ssh/authorized_keys. Write to another file first can't check is your key first just to be sure)
Another machine or ... another ip address [that is virtually a machine in a data center]?
This is because your question is an infrastructure question, not a data engineering question.
Sure it would be better if the engineers knew about infrastructure, but they don't have to know about it.
And honestly, the weeds of physical data transfer between computers is actually very complex, and I think data engineers should leave it to the pros rather than screwing around trying some stuff that's guaranteed to fail either regarding security, performance or caching.
Data engineers should know about how to deal with data content, not data transport. If that's what your role is about, you should re-brand it.
How large is the file? Past certain sizes it's faster to move it by truck rather than available internet channel.