One organization recently gave me a take home with the following demands:
--
Here are 150k long-form (6000+ words) documents, and a list of labels.
Please use a recent transformer model as the vectorization/representation layer to train a multilabel classifier on this data set.
You can use CoLab and their free GPU tier, but we won't pay for any GPU/TPU time.
Also, please compare this solution to other algorithms (linearSVM, XGBoost) and write 1000 words about the performance tradeoffs.
---
I'm not a deep learning expert, and I assumed that transformer models are basically limited to short form text with the exception of the Longformer and Big Bird architecture, and I was under the impression that those solutions were pretty memory intense. Other solutions are limited to only looking at the first 512 characters or so. And I'm not even sure CoLab's free tier can handle this.
Is this too much? Part of me is really excited to try this but another part of me is already imagining the compute time/space required to run this thing.
That means plugging in the data to the tool they use all the time, adjusting a few parameters, and writing the 1000 words while waiting for the results…that’s possible because they know the tradeoffs they made as they made them.
If it looks like a hard and interesting challenge, that desired candidate isn’t you.
Nothing wrong with that.
They want someone who can knock this class of problems out before lunch on a Tuesday when the boss asks at 10am.
Re: the length problem, some ideas:
a) Just use the first part of each doc. For most humans, in most cases, that would be enough to do the classification.
b) Mix head and tail of each doc (i.e. cut out as much of the middle as necessary).
c) Split each doc into as many parts as needed, classify all of them, and then go with the majority vote.
Good luck with the interview!
> write 1000 words about the performance tradeoffs.
Big no. And it shouldn't be a close decisions for you - this is way too much work to do for someone you aren't working for. Who in 2022 is asking for 1000 words as an explanation? And who is actually reading it???
Fascinating stuff, thanks for sharing.
(I kind of wish take homes were less common and more companies were doing LC for data science and ML)
Regarding jstx1's question of who was going to read the 1000 words: obviously, an NLP text summarization package that will spit out a number from 1 to 5.