Wasting money with data science
Investing money with the latest hype is normal. Wasting it once the hype has passed is dangerous.
The majority of the companies around me are wasting money with data science.
They see overwhelming evidence that data science1 is changing sectors and creating new business opportunities. Just by looking at the Dutch landscape, there is no doubt that teams around us are using data science to create value. Off the top of my head Bol.com, Uber (Eats), Booking.com, ING, NPO, Marktplaats, Quby, etc.
But for each of them, there's a handful of companies that are not successful and, in fact, wasting their resources with data science.
It all starts with not understanding what data science can do to add value to the bottom line and, most importantly, the enablers to make it possible.
Let's look at an example. I am sure you can adapt it to your company.
A hospital wants to predict if the patients entering the emergency room, based on information they have during the intake, will be hospitalized or not.
The prediction allows the hospital to better plan the resources of the various departments. This will, in turn, lead to cost savings.
Some data is gathered, given to data scientists, and — after two weeks — the first demo takes place. The results are promising, but they need a bit more time.
Fine. After all, the data was messy: they had to clean it up and go back to the source a couple of times.
Two weeks pass and the new results are even nicer. With 70% accuracy, they can predict if a patient will go home after their visit to the emergency room.
This is much better than random (50%)! A full-fledged pilot starts.
They are faced with a couple of challenges to go from model to data product:
- How to send the source data to the model is unclear;
- Where the model should run;
- The hospital operations need to change, as the intake happens with pen and paper;
- They realize that without knowing to which department the patient will go, they won't add any value;
- To predict the department, the model need the diagnosis. But once the diagnosis gets typed in the computer, the patient has reached their destination: the model is useless!
If you think this is unusual, I cannot tell you how many proofs of concept (PoCs) I have witnessed that suffers from (some of) the same weaknesses:
- No clear business case;
- No data platform where data pipelines can be created;
- No awareness of the impact on operations (the pen and paper in the hospital example);
- No realization that a model is useful only if the predictions are timely;
- No clear hand-over mechanism once the first iteration of a model is finished (i.e. where will it run and whose responsibilities will it be).
The list goes on, but you get the gist.
What do you need to make it all happen? I can think of at least these roles:
- Data Engineer (pipelines and platform) and Lead Data Engineer;
- Data Scientist and Lead Data Scientist;
- Data Science savvy Product owner (define and refine the business case);
- System Administrators to monitor models in production, etc;
- Software developers to embed or integrate the data product with other business applications, websites, apps.
- Database administrators from the other departments to open up the databases, etc.
On the "operational" side, you need
- A data platform, where pipelines run and where data land;
- A data driven mentality where data and knowledge can flow freely between organizational silos;
- A data science workflow: how to improve the model once the first iteration is running, how to hand the model over, how should the business give input, how to close the feedback cycle etc.
If you just count the roles needed to have a team in place that can deliver data driven models/products, I come up with the following:
- 2 Data Engineers, 1 Lead
- 1 Data Scientist, 1 Lead
- 1 Product Owner
- 2 System Administrators (that's low for redundancy, but still).
In The Netherlands we can assume the competitive landscape pushes the cost for each role at around 100.000 EUR/year (including social costs). I'm probably being a bit conservative here.
Adding the platform cost (let's round it up to 100.000 EUR/year), it comes down to 1M EUR/year.
What does this money buy you? Let's say the team has enough throughput to deliver 5 models/year (1 model each 2 months, including some holidays here and there, training, conferences).
These 5 models are unmaintained: you don't add new features, you don't maintain data pipelines, etc. Doing all the above, in a robust fashion, probably reduce the throughput to maybe 3 models/year — at some point, though, you will need more people to maintain older models, or they will stop working.
Let's not forget the software engineers and DBAs and subject matter experts from the other departments that need to be involved for all this data and knowledge flows. Approximately 500.000 EUR/year?
Let's look at the math again: 3M EUR for 6 data science cases in production (3/year for 2 years). The first cases will probably hit production after ~6/12 months, as the platform, data pipelines, dev and production environments, and so on need to be there.
Most Dutch companies do not just invest 3M EUR for 6 data science cases. Why?
- A serious executive buy-in is missing: a buy-in that involves at least 3M EUR;
- There is no data strategy: people start doing things, without a clear picture of what all the above entails. Without strategy, it is hard to invest the right amount of money on something that, on paper, has the potential to disrupt the industry and give a clear competitive advantage to the company. In practice, however, all this is potential and not a sure investment.
So what happens? People start projects that are dead on arrival. They still waste huge amount of money, but no value is added to the bottom line.
This is the whole reasons large companies buy start-ups. By having just 1-2 roles dedicated to a core data product, by taking in large amounts of technical debts and more, start-ups are able to prove the business case quickly. Once that is done, there are two options:
- The start-up needs an even larger amount of money to pay the technical debt off and scale;
- They get bought and the buyer will spend that money.
How do you fix it all?
First: make a data strategy. Which areas would benefit more from data science, how to do hiring, how to become data-driven, etc. Especially: how much money is needed to get the flywheel spinning?2
Second: collect the business cases and prioritize them by value. Are they feasible up until the moment where they generate money?
Third: Get external help to validate the 1-2 best use cases. Good consultants can quickly show if something is feasible, what's the — give or take — expected accuracy, run time, and so on.
Fourth: Build the platform, formalize the hiring, etc. This will maybe take 6-8 months. Do not skip the Lead DE and DS — you should start with them!3
Fifth: Get the cases into production. This will take 3-4 months the first time, as lots need to be planned, adapted, etc.
Sixth: Evaluate the business cases, refine them, and be sure to have a "lessons learned" moment.
Seven: On-board new business cases, rinse and repeat.
Is it really a fix?
It is easy to write a blog post with eight simple steps that take more than one year and a couple million euros.
It is much harder to implement all the 8 steps.
This is why executive buy-in is as important as the data strategy. Without it, it's impossible to endure all the small and large fires along the way — there will be many especially when starting out.
Are all these roles needed?
Some data science projects are mislabelled. They are glorified data analysis projects, where a bit of SQL, a bit of visualization, and — maybe — a bit of Python will be enough for a dashboard, a report, an Excel file.
If the project is more data analysis than data science then, don't worry: most of what I wrote above does not apply.
I still think you're wasting money though: you probably hired data scientists — more expensive than data analysts if we believe the various rumours and surveys — and have a semi-functional data platform — more complex than your run-of-the-mill data warehouse — that you don't really need: you were doing data analysis before Just Fine©️.
Is GoDataDriven the right partner to start with the 8 steps?
Yes (this was an easy one).
If you want more ramblings, follow me on Twitter: I'm gglanzani there!
I will use Data Science as synonym of Machine Learning for the purpose of this post, although this is not factually accurate. ↩
Start immediately with hiring. In the Netherlands this takes a lot of time. Nobody new to the field is aware of this, and the HR departments of large corporations are often not up to the task. ↩
Are Leads too expensive? Yes, they are. Less than external consultants though, and hopefully as capable as them. Be also sure to hire engineers first. You have a model written by the external consultants that can be put into production; engineers alone can do that assuming the consultants did a good job. ↩
Follow us for more of this
Testing and debugging Apache Airflow
February 22, 2019
The Zen of Python and Apache Airflow
February 18, 2019
AWS Machine Learning Competency Status for GoDataDriven
February 14, 2019
GoDataDriven Open Source Contribution for January 2019, the Apache Edition
February 13, 2019
Our social responsibility as a company
February 08, 2019
Keras: multi-label classification with ImageDataGenerator
January 31, 2019