Data science is hot! Changing your job title to data scientist on LinkedIn will trigger a small army of recruiters to swarm you with all kinds of awesome job offers. So in terms of career prospective jumping on the data science hype train seems like a good bet. But what skills do you need in order to be a good data scientist? No wait, let’s rephrase that question: what is data science in the first place?
This might seem like a silly question but if we dig a little deeper we find that data science actually does not always mean the same thing. If you ask different people it can be anything from market research to hardcore artificial intelligence. Although both of these extremes build on utilizing data, the skills needed for either discipline vary wildly.
So what does data science mean in the context of bol.com? This series of blog posts is meant to shed some light on what we, the data scientists of bol.com, think data science is and what a good data scientist should be able to do. We have identified four key aspects that should be central in the skillset of every bol.com data scientist and asked some of our team members how these aspects feature in their day-to-day work.
When you go to the bol.com website what you see is only the tip of the iceberg. The inner workings of bol.com is a complicated web of microservices each doing their own essential thing to keep bol.com running. Data Science is part of this landscape and whatever we think up will need to live in this landscape. This means that programming is a daily part of our lives. In this second of our series of blog posts we ask our data scientists how they experience this.
Without bringing your solution to reality, your practical added value is 0. We love to see our work making a difference, people using our products and services. Therefore, programming is a crucial part of the data scientist work. Of course, different people means different strengths and preferences. However, the least we expect is that one can create a proof of concept and can then limit the possibility of misunderstanding in delegating tasks when bringing this production. You wouldn’t like to just instruct somebody how to grow your baby and call it a day or would you?
In the last few years every machine/deep learning model is included in a (Python/Java/R/C++/etc.) package and it is easy to apply what you know. However, if you do not know how to transform your data or even know how split a train and test set in your preferred programming language then your skills are not yet at the level that people expect. You need to know how to apply your ideas from a basic proof of concept script towards a stable final script, which can be easily transformed to a production-ready service. Tip: if you want to improve your programming skills and your machine learning skills, try to make an algorithm from scratch without any package.
Every handyman needs its toolkit. For a Data Scientist these are damn fast machines (hardware) and cutting edge applications (software). To get both hardware and software to work together smoothly, having sound knowledge of programming is paramount. Although eventually every Data Scientist will end up with Python one day as it is regarded as the go-to programming language for Data Science, programming experience with another programming language (e.g. Java, R, C) can already give you a solid basis to get going. The most important thing is that you understand how to handle a machine, know its limitations and be able to instruct it statistical procedures to learn from data.
For me, computers are just big calculators: a tool that I need to control to do the work I like. My primary interest is getting correct results which I need to build a minimal viable product. I expand my skills and knowledge as I go and if I need to, but it would be nice to become a full stack Data Scientist and deliver turn-key solutions at some point.
Solving a problem conceptually and theoretically is fine, solving it in practice however is king. It is therefore important that you can translate what you have in your head to something in working code as a proof of concept. And it may happen that your awesome solution on paper breaks down when it comes into contact with reality, because reality is harsh and has some serious limitations on for instance available computing resources. For instance, since we work with large amounts of data, available memory can become a problem even with conceptually simple solutions. How to deal with these issues is something that should always be in the back of your head when tackling a problem. Also, always remember that a proof of concept that works on your laptop is not the end of the line.