Data science is hot! Changing your job title to data scientist on LinkedIn will trigger a small army of recruiters to swarm you with all kinds of awesome job offers. So in terms of career prospective jumping on the data science hype train seems like a good bet. But what skills do you need in order to be a good data scientist? No wait, let’s rephrase that question: what is data science in the first place?
This might seem like a silly question but if we dig a little deeper we find that data science actually does not always mean the same thing. If you ask different people it can be anything from market research to hardcore artificial intelligence. Although both of these extremes build on utilizing data, the skills needed for either discipline vary wildly.
So what does data science mean in the context of bol.com? This series of blog posts is meant to shed some light on what we, the data scientists of bol.com, think data science is and what a good data scientist should be able to do. We have identified four key aspects that should be central in the skillset of every bol.com data scientist and asked some of our team members how these aspects feature in their day-to-day work.
This first blog post delves into Machine Learning. Machine Learning provides an extensive toolbox of mathematical methods and models that allow us to make predictions based on data. This ability to predict rather than describe is what sets data science apart from other data driven departments within bol.com. Any future bol.com data scientist should therefore be familiar with the basics of Machine Learning. But how do we put this Machine Learning toolbox into practice? Our team members will elaborate on this.
Machine Learning and Deep Learning are nowadays alluring and misleading terms. Most people will tell you that some algorithms can solve everything by learning everything themselves. And yes, there are algorithms, like neural nets, that let a machine iteratively learn how to represent the transformation from input to output. Yes, these algorithms can represent almost any function. But you don’t need a shotgun to kill a mosquito. In most cases a simple newspaper can kill that mosquito just as well. And chances are it will be less hassle to arrange for a newspaper than it will be to arrange for a shotgun. It all depends on your problem, your data and your features. Sometimes simple boring logistic regression can kill the mosquito. So it is good to have intuition about the models and know what they expect as input and output. Make an educated guess based on your knowledge, which can come from statistics, calculus, linear algebra or just plain experience. And, maybe, in some cases you do need the shotgun.
Machine Learning to me is the process in which you learn a program to make decisions by feeding it meaningful data and a bunch of statistical procedures. In this process different questions arise. Why do we want a machine to make this decision and not just let a human decide? What data do we have available, what data is useful and how do we make it applicable to our use case? What statistical models can be used and which can’t? How do we prevent bias and variance from creeping into our trained model? How do we evaluate the performance of a machine’s decision? Are you able to tune that performance? And if it is performing well, how do you implement such a machine learned model in a scalable and robust way? These are all questions that need to be asked when putting Machine Learning into practice. And as a Data Scientist you might not be an expert on all of these subjects but you should be able to finish this process from start to finish.
I think the main purpose of a Data Scientist is to identify the nuggets of value within a vast sea of data. Being able to understand and reason with the data within the context that it has been created in is critical. The tools and models that you use to harvest the value from data at scale are to some extent irrelevant, but of course they should be properly understood and applied. Within a Data Science project, gathering (and exploiting) insights is of equal or greater importance in tackling a problem than boosting the performance of an model. Therefore I prefer to use transparent models in the early stages of the project: it gives me maximum information on the task at hand, which I can use to reach the end goal faster.
I am not formally trained in Machine Learning. I come from an astronomy background and although I have tons of experience with data and analyses, I only came into contact with actual Machine Learning relatively recently. Because of that I have a less theoretical outlook on Machine Learning and rely more on intuition. Understanding the concept of a methodology or technique is much more important to me than knowing the mathematical details. Apart from that I try to work data driven as much as possible. Your model is only ever going to be as good as the data you put in, so understanding the potential and limitations of your data are just as important to me as knowing which model to use.
Nowadays, applying a Machine Learning algorithm to figure out a certain pattern behind a dataset can be as easy as executing one line of code. However, just because there is an easy “solution” does not mean that this is all there is to it. For instance, how can it be that a simple log transformation of the number of orders increases the forecasting accuracy with a stunning 10%? Isn’t this something that your favorite “swiss-army-knife” algorithm should figure out by itself? How about which features to include? Does the old adage of “the more, the better” always count? Our approach at bol.com is that we do not adapt the problem to “The” algorithm, rather we choose the best algorithm for “The” problem at hand. Because making the wrong choices or not fully understanding your problem can have consequences. If you are bad at programming, your code does not run and you do not create an impact. However, if you are bad at statistics, your conclusions could be wrong and you might create a negative impact!