What do you look for when hiring an entry-level data scientist? Would a master’s in Da… by Eduardo Arino de la Rubia
Answer by Eduardo Arino de la Rubia:
I think that realistically, either one of them is enough to get me to look at a resume, but absolutely positively neither of them is enough to motivate me to make a hiring decision. I believe there is a misconception for what it is that hiring managers are looking for in entry level data scientists, and the atmosphere of credentialism does a great disservice to people interested in making the transition into a data science career. What I look for the most is some signal that the junior data scientist:
- Has the drive and determination to be a self-directed learner
- They understand the fundamentals of “enough” programming,
- They understand how to analyze data when the goals and metrics are not explicit or time boxed.
Let’s put aside the need for some level of formal training, that is a non-negotiable baseline. You have to have enough understanding of mathematics and statistics to know when you are getting yourself into trouble, you have to understand data management practice enough to understand how to access data, and you have to understand enough about machine learning to make the appropriate series of tradeoffs in model development and validation. That is table stakes, however what makes one candidate stand out above the others is everything else surrounding these core concepts.
Drive and Determination to be a Self Directed Learner
Learning in class is easier than learning on your own. You have a professor who is ostensibly payed to teach you a corpus of agreed upon material, they do so under a schedule which is dictated by a syllabus, and they evaluate it using something approaching best practices. Having a stellar academic record which shows you succeed in traditional pedagogy is great, but insufficient. I need to see that you have learned and applied things outside of the traditional canon. If you come from a traditional statistics program, I want to see that you have branched out into some non-statistical approaches. If you come from a background in operations research, I want to see that you have completed some project leveraging NLP, etc… I wholly believe that in data science one of the keys to success is the ability to understand when your efforts will be “leveled up” by knowing when it is time to sharpen the blade. No one tells you that six months from now, understanding how not to overfit a GBM will be relevant, you have to have an intuition and a desire to understand it on your own. This is often showcased by giving talks at local meet ups, blog posts, and GitHub projects. I don’t require world-class scholarship, by any stretch of the imagination, but I do require signal that you understand that the education you were given is an arbitrarily chosen curriculum, and that the set of things outside of that curriculum is worth your scholarship.
Fundamentals of “enough” programming
I have had the good fortune of being a CS lab assistant a number of times in my life, teaching people the basics of programming. I have also mentored a number of students through introductory MOOCs for programming. An unpopular position I hold is that not everyone is capable of mustering the time/energy/interest/frustration tolerance/luck to be able to learn enough code to build things. I am not making a statement about intelligence or natural ability, I simply am arguing that learning to program is an awful journey that we have not learned to teach well yet, and that unfortunately a significant percentage of individuals who set out on the journey do not make it. Junior data scientists are squarely in the danger zone regarding this particular skill. Very few academic programs or bootcamps have enough time in the curriculums to devote the sufficient energy to writing code. While a data scientist does not need to be able to engineer beautiful systems, a panel of data scientists at a recent conference agreed that they need to be able to write “about 500 lines of coherent code.” This is a nontrivial amount, well above the bar for many junior data science applicants. The strongest signal that you are capable of doing enough programming is simply a strong GitHub presence. A profile with projects in many stages of completion – some sketches, some completed documented projects – will immediately put a junior data scientist at the top of the resume pile. To be clear, none of these projects have to be groundbreaking, but they must show that the junior data scientist is capable of progressing from idea to “working artifact.”
Another incredibly strong signal is a history of collaboration on GitHub. If a junior data scientist has contributed bugs to open source projects, that is a strong signal that they understand a model of collaboration that is valuable and almost directly transferable to most data science organizations. Submitting issues to open source project maintainers, with reproducible examples, and in a perfect scenario with pull requests (even if they are incomplete and rejected) nearly always guarantees that a junior data scientist will receive a callback to further the discussion. This is someone who understands enough about programming that they have made it past the activation energy.
Analyze data when the goals and metrics are not explicit or time boxed.
Finally, a junior data scientist who has completed an analysis independently, and has produced quality artifacts and compelling narrative and findings is a top tier candidate. Data science suffers an ambiguity that many are uncomfortable operating in. Except in the most advanced organizations, data scientists often have to find their own way around complex poorly documented data sources, with ambiguous goals and feedback loops, which often confound and obscure signal and any insights. A data scientist who has taken a public dataset and leveraged it in a non-intuitive way to elucidate some previously obscured insight is valuable. I look for curious people who have taken datasets, and outside of the schema of a course or top-down initiative, found time to build an analysis, clean data, find features, train models, and communicate something interesting. The “golden example” of such a thing (though he is far from a junior data scientist) would be something like David Robinson’sof Trump’s tweets on Android/iPhone devices. This required collection of a dataset and nontrivial analysis, and yet it provided some wonderful insights. A blog post of this nature, a project or a analysis shows me that you were able to deliver a data driven insight independently. As a junior data scientist hopefully you will be in a position to receive mentorship and direction, so you won’t be in a position where you have to generate all your own activation energy, however knowing you can be trusted to do so independently is a powerful hiring signal.
I think we have reached a stage in data science hiring where there is a preponderance of junior talent attempting to make the transition. However, due to a mismatch in expectations from academic and bootcamp programs, and sadly hyperbolic market signals as to the nature and quantity of these jobs, the ratio of candidates to candidates-who-will-succeed is not great. I want signals that let me know that you will be productive in the things that they don’t teach you in school. I want to know that you understand how to be independent, how to write code, and how to drive to insights when everyone is busy and no one has time to help mentor you. A masters degree or a bootcamp certification are all signals that I will take into account, but neither is make-or-break. For me, it’s everything else around your CV that motivates me to take the conversation further.