The Biggest Mistakes Made by Data Scientists

While the tools may change, the mistakes stay the same. Here are four common issues that IT leaders should be aware of when managing data science teams.

Guest Commentary, Guest Commentary

November 22, 2019

4 Min Read
Image: metamorworks - stockadobe.com

In 2019, companies looking to gain an edge on competitors and insight into customers and trends have come to rely more heavily on data scientists to inform their business decisions. A good data scientist is invaluable to a company with any online presence. They will assess and interpret complex information and build out machine learning algorithms. 

Data volume keeps growing, and the amount of skill and effort needed to create data-driven initiatives is certainly keeping pace with that growth. Mistakes can produce huge consequences and, while the tools may change, the mistakes stay the same. Over the course of my career I’ve seen every permutation of these common mistakes, and my hope here is to help you identify and avoid them within your own teams.

Mistake #1: Lack of coding skills

This one may seem obvious, but you would be amazed at the number of people who feel data science is a career completely removed from the practice of coding. The central tenet of data science is, and really has always been, building a model with a long script. The quality of that script (or lack thereof) has endless consequences, from scalability to robustness of the model when it goes in production.

An excellent data scientist must also be a good programmer. My rule is: a senior data scientist must possess a mid-level software engineer’s coding skill and a mid-level data scientist should be on par with a junior software engineer.

Mistake #2: Lack of defensive mindset

The adage goes “the best offense is a good defense” and, while sports rarely overlap with code, in this case the saying is apt. Teams need to emphasize the mindset: “How wrong can the model be on a bad day?”

A single mistake can become a financial and legal consequence to the company. If you don’t test and retest your code with a defensive mindset, it will certainly have errors.

In machine learning, people use performance metrics like precision, RMSE, and MAE. Those are averages and do not act as a replacement for defensive testing.

Mistake #3: Poor use of time on data cleansing

In my career, I have trusted my data science teams’ data exploration skills and I rarely saw a data scientist make a data mistake. They have all been smart and prudent.

I have, however, seen numerous cases where they spend several weeks looking at the data, refusing to build the end-to-end ML software. This is too much time on data cleansing and ignoring the task of building the end-to-end flow.

I see a huge difference between a computer science-trained data scientist and a physics-trained data scientist. I come from physics, but I strongly prefer the “let’s write some code” approach.

Unless you build the ship, there will be many unforeseen holes that will sink you later. I would also anticipate the project managers will have little patience on troubleshooting numerous errors. They need something to show the product leaders on the fixed deadlines.

Mistake #4: Time wasted on studying individual models

When a data scientist spends too much time studying individual models, he or she can lose sight of how the models should talk to each other. A dynamic pricing project can easily affect an ad bidding project, which doesn’t normally know the price that the clicker will get. This question certainly belongs to the senior data scientists and their managers.

To prove useful, actions need to be taken on data collection. It’s up to the data scientist to help his or her company move through digital transformation by monitoring, testing, performing robust analytics, and building machine learning infrastructure to improve business practices and solve problems. By helping your data scientists with the above points, they can better support the company.

Xin_Heng-Punchh.jpg

Xin Heng is VP of Data at Punchh, Inc., in San Mateo, California, where his team's primary responsibility is to build the world-class data solutions to drive the growth of both Punchh and its business partners. Prior to joining Punchh, Heng was the Head of Data Science at StubHub and Data Science Manager at Uber. He holds a Ph.D. in electrical engineering from the California Institute of Technology and a Master of Financial Engineering from the Walter Haas School of Business at the University of California, Berkeley. His Twitter handle: @xheng123

 

About the Author

Guest Commentary

Guest Commentary

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT professionals in a meaningful way. We publish Guest Commentaries from IT practitioners, industry analysts, technology evangelists, and researchers in the field. We are focusing on four main topics: cloud computing; DevOps; data and analytics; and IT leadership and career development. We aim to offer objective, practical advice to our audience on those topics from people who have deep experience in these topics and know the ropes. Guest Commentaries must be vendor neutral. We don't publish articles that promote the writer's company or product.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights