Book Image

Feature Store for Machine Learning

By : Jayanth Kumar M J
Book Image

Feature Store for Machine Learning

By: Jayanth Kumar M J

Overview of this book

Feature store is one of the storage layers in machine learning (ML) operations, where data scientists and ML engineers can store transformed and curated features for ML models. This makes them available for model training, inference (batch and online), and reuse in other ML pipelines. Knowing how to utilize feature stores to their fullest potential can save you a lot of time and effort, and this book will teach you everything you need to know to get started. Feature Store for Machine Learning is for data scientists who want to learn how to use feature stores to share and reuse each other's work and expertise. You’ll be able to implement practices that help in eliminating reprocessing of data, providing model-reproducible capabilities, and reducing duplication of work, thus improving the time to production of the ML model. While this ML book offers some theoretical groundwork for developers who are just getting to grips with feature stores, there's plenty of practical know-how for those ready to put their knowledge to work. With a hands-on approach to implementation and associated methodologies, you'll get up and running in no time. By the end of this book, you’ll have understood why feature stores are essential and how to use them in your ML projects, both on your local system and on the cloud.
Table of Contents (13 chapters)
1
Section 1 – Why Do We Need a Feature Store?
4
Section 2 – A Feature Store in Action
9
Section 3 – Alternatives, Best Practices, and a Use Case

The most time-consuming stages of ML

In the first section of this chapter, we went through the different stages of the ML life cycle. Let's look at some of the stages in more detail and consider their level of complexity and the time we should spend on each of them.

Figuring out the dataset

Once we have a problem statement, the next step is to figure out the dataset we need to solve the problem. In the example we followed, we knew where the dataset was and it was given. However, in the real world, it is not that simple. Since each organization has its own way of data warehousing, it may be simple or take forever to find the data you need. Most organizations run data catalog services such as Amundsen, Atlan, and Azure Data Catalog to make their dataset easily discoverable. But again, the tools are as good as the way they are used or the people using them. So, the point I'm making here is that it's always easy to find the data you are looking for. Apart from this, considering the access control for the data, even if you figure out the dataset that's needed for the problem, it is highly likely that you may not have access to it unless you have used it before. Figuring out access will be another major roadblock.

Data exploration and feature engineering

Data exploration: once you figure out the dataset, the next biggest task is to figure out the dataset again! You read that right – for a data scientist, the next biggest task is to make sure that the dataset they've picked is the right dataset to solve the problem. This would involve data cleaning, augmenting missing data, transforming data, plotting different graphs, finding a correlation, finding out data skew, and more. The best part is that if the data scientists find that something is not right, they will go back to the previous step, which is to look for more datasets, try them out again, and go back.

Feature engineering is not easy either; domain knowledge becomes key to building the feature set to train the model. If you are a data scientist who has been working on the pricing and promotion models for the past few years, you would know what dataset and features would result in a better model than a data scientist who has been working on customer value models for the past few years and vice versa. Let's try out an exercise and see if feature engineering is easy or not and if domain knowledge plays a key role. Have a look at the following screenshot and see if you can recognize the animals:

Figure 1.13 – A person holding a dog and a cat

Figure 1.13 – A person holding a dog and a cat

I'm sure you know what these animals are, but let's take a step back and see how we correctly identified the animals. When we looked at the figure, our subconscious did feature engineering. It could have picked features such as it has a couple of ears, a couple of eyes, a nose, a head, and a tail. Instead, it picked much more sophisticated features, such as the shape of its face, the shape of its eyes, the shape of its nose, and the color and texture of its fur. If it had picked the first set of features, both animals would have turned out to be the same, which is an example of bad feature engineering and a bad model. Since it chose the latter, we identified it as different animals. Again, this is an example of good feature engineering and a good model.

But another question we need to answer would be, when did we develop expertise in animal identification? Well, maybe it's from our kindergarten teachers. We all remember some version of the first 100 animals that we learned about from our teachers, parents, brothers, and sisters. We didn't get all of them right at first but eventually, we did. We gained expertise over time.

Now, what if, instead of a picture of a cat and a dog, it was a picture of two snakes and our job was to identify which one of them is venomous and which is not. Though all of us could identify them as snakes, almost none of us would be able to identify which one is venomous and which is not. Unless the person has been a snake charmer before.

Hence, domain expertise becomes crucial in feature engineering. Just like the data exploration stage, if we are not happy with the features, we are back to square one, which involves looking for more data and better features.

Modeling to production and monitoring

Once we've figured out the aforementioned stage, taking the model to production is very time-consuming unless the right infrastructure is ready and waiting. For a model to run in production, it needs a processing platform that will run the data cleaning and feature engineering code. It also needs an orchestration framework to run the feature engineering pipeline in a scheduled or event-based way. We also need a way to store and retrieve features securely at low latency in some cases. If the model is transactional, the model must be packaged so that it can be accessed by the consumers securely, maybe as a REST endpoint. Also, the deployed model should be scalable to serve the incoming traffic.

Model and data monitoring are crucial aspects too. As model performance directly affects the business, you must know what metrics would determine that the model needs to be retrained in advance. Other than model monitoring, the dataset also needs to be monitored for skews. For example, in an e-commerce business, traffic patterns and purchase patterns may change frequently based on seasonality, trends, and other factors. Identifying these changes early will affect the business positively. Hence, data and feature monitoring are key in taking the model to production.