Book Image

Feature Store for Machine Learning

By : Jayanth Kumar M J
Book Image

Feature Store for Machine Learning

By: Jayanth Kumar M J

Overview of this book

Feature store is one of the storage layers in machine learning (ML) operations, where data scientists and ML engineers can store transformed and curated features for ML models. This makes them available for model training, inference (batch and online), and reuse in other ML pipelines. Knowing how to utilize feature stores to their fullest potential can save you a lot of time and effort, and this book will teach you everything you need to know to get started. Feature Store for Machine Learning is for data scientists who want to learn how to use feature stores to share and reuse each other's work and expertise. You’ll be able to implement practices that help in eliminating reprocessing of data, providing model-reproducible capabilities, and reducing duplication of work, thus improving the time to production of the ML model. While this ML book offers some theoretical groundwork for developers who are just getting to grips with feature stores, there's plenty of practical know-how for those ready to put their knowledge to work. With a hands-on approach to implementation and associated methodologies, you'll get up and running in no time. By the end of this book, you’ll have understood why feature stores are essential and how to use them in your ML projects, both on your local system and on the cloud.
Table of Contents (13 chapters)
1
Section 1 – Why Do We Need a Feature Store?
4
Section 2 – A Feature Store in Action
9
Section 3 – Alternatives, Best Practices, and a Use Case

An ideal world versus the real world

Now that we have spent a good amount of time building this beautiful data product that can help the business treat customers differently based on the value they bring to the table, let's look at what we expect from this versus what it can do.

Reusability and sharing

Reusability is one of the common problems in the IT industry. We have this great data for a product in front of us, the graphs we built during exploration, and the features we generated for our model. These can be reused by other data scientists, analysts, and data engineers. With the state it is in currently, can it be reused? The answer is maybe. Data scientists can share the notebook itself, can create a presentation, and so on. But there is no way for somebody to discover if they are looking for, say, customer segmentation or RFM features, which could be very useful in other models. So, if another data scientist or ML engineer is building a model that needs the same features, the only option they are left with is to reinvent the same wheel. The new model may be built with the same, more accurate, or less accurate RFM features based on how the data scientist generates it. However, it could be a case where the development of the second model could have been accelerated if there was a better way to discover and reuse the work. Also, as the saying goes, two heads are better than one. A collaboration would have benefitted both the data scientist and the business.

Everything in a notebook

Data science is a unique skill that is different from software engineering. Though some of the data scientists might have a software engineer background, the needs of the role itself may push them away from software engineering skills. As the data scientists spend more time in the data exploration and model building phases, the integrated development environments (IDEs) may not be sufficient as the amount of data they are dealing with is huge. The data processing phase will run for days if we have to explore, do feature engineering, and do model building on our personal Mac or PC. Also, they need to have the flexibility to use different programming languages such as Python, Scala, R, SQL, and others to add commands dynamically during analysis. That is one of the reasons why there are so many notebook platform providers, including Jupyter, Databricks, and SageMaker.

Since data product/model development is different from traditional software development, it is always impossible to ship the experimental code to production without any additional work. Most data scientists start their work in a notebook and build everything in the same way as we did in the previous section. A few standard practices and tools such as feature store will not only help them break down the model building process into multiple production-ready notebooks but can also help them avoid re-processing data, debugging issues, and code reuse.

Now that we understand the reality of ML development, let's briefly go through the most time-consuming stages of ML.