An ideal world versus the real world
Now that we have spent a good amount of time building this beautiful data product that can help the business treat customers differently based on the value they bring to the table, let's look at what we expect from this versus what it can do.
Reusability and sharing
Reusability is one of the common problems in the IT industry. We have this great data for a product in front of us, the graphs we built during exploration, and the features we generated for our model. These can be reused by other data scientists, analysts, and data engineers. With the state it is in currently, can it be reused? The answer is maybe. Data scientists can share the notebook itself, can create a presentation, and so on. But there is no way for somebody to discover if they are looking for, say, customer segmentation or RFM features, which could be very useful in other models. So, if another data scientist or ML engineer is building a model that needs the same features, the only option they are left with is to reinvent the same wheel. The new model may be built with the same, more accurate, or less accurate RFM features based on how the data scientist generates it. However, it could be a case where the development of the second model could have been accelerated if there was a better way to discover and reuse the work. Also, as the saying goes, two heads are better than one. A collaboration would have benefitted both the data scientist and the business.
Everything in a notebook
Data science is a unique skill that is different from software engineering. Though some of the data scientists might have a software engineer background, the needs of the role itself may push them away from software engineering skills. As the data scientists spend more time in the data exploration and model building phases, the integrated development environments (IDEs) may not be sufficient as the amount of data they are dealing with is huge. The data processing phase will run for days if we have to explore, do feature engineering, and do model building on our personal Mac or PC. Also, they need to have the flexibility to use different programming languages such as Python, Scala, R, SQL, and others to add commands dynamically during analysis. That is one of the reasons why there are so many notebook platform providers, including Jupyter, Databricks, and SageMaker.
Since data product/model development is different from traditional software development, it is always impossible to ship the experimental code to production without any additional work. Most data scientists start their work in a notebook and build everything in the same way as we did in the previous section. A few standard practices and tools such as feature store will not only help them break down the model building process into multiple production-ready notebooks but can also help them avoid re-processing data, debugging issues, and code reuse.
Now that we understand the reality of ML development, let's briefly go through the most time-consuming stages of ML.