Designing systems at scale
To build on the ideas presented in Chapter 5, Deployment Patterns and Tools, and in this chapter, we should now consider some of the ways in which the scaling capabilities we have discussed can be employed to maximum effect in your ML engineering projects.
The whole idea of scaling should be thought of in terms of providing an increase in the throughput of analyses or inferences or ultimate size of data that can be processed. There is no real difference in the kind of analyses or solution you can develop, at least in most cases. This means that applying scaling tools and techniques successfully is more dependent on selecting the correct processes that will benefit from them, even when we include any overheads that come from using these tools. That is what we will discuss now in this section, so that you have a few guiding principles to revisit when it comes to making your own scaling decisions.
As discussed in several places throughout this book, the pipelines you develop for your ML projects will usually have to have stages that cover the following tasks:
- Ingestion/pre-processing
- Feature engineering (if different from above)
- Model training
- Model inference
- Application layer
Parallelization or distribution can help in many of these steps but usually in some different ways. For ingestion/pre-processing, if you are operating in a large scheduled batch setting, then the ability to scale to larger datasets in a distrubuted manner is going to be of huge benefit. In this case, the use of Apache Spark will make sense. For feature engineering, similarly they main bottleneck is in processing large amounts of data once as we perform the transformations, so again Spark is useful for this. The compute-intensive steps for training ML models that we discussed in detail in Chapter 3, From Model to Model Factory, are very amenable to frameworks that are optimized for this intensive computation, irrespective of the data size. This is where Ray comes in as discussed in the previous sections. Ray will mean that you can also neatly parallelize your hyperparameter tuning if you need to do that too. Note that you could run these steps in Spark as well but Ray’s low task overheads and its distributed state management mean that it is particularly amenable to splitting up these compute-intensive tasks. Spark on the other hand has centralized state and schedule management. Finally, when it comes to the inference and application layers, where we produce and surface the results of the ML model, we need to think about the requirements for the specific use case. As an example, if you want to serve your model as a REST API endpoint, we showed in the previous section how Ray’s distribution model and API can help facilitate this very easily, but this would not make sense to do in Spark. If, however, the model results are to be produced in large batches, then Spark or Ray may make sense. Also, as alluded to in the feature engineering and ingestion steps, if the end result should be transformed in large batches as well, perhaps into a specific data model such as a star schema, then performing that transformation in Spark may again make sense due to the data scale requirements of this task.
Let’s make this a bit more concrete by considering a potential example taken from industry. Many organizations with a retail element will analyze transactions and customer data in order to determine whether the customer is likely to churn. Let’s explore some of the decisions we can make to design and develop this solution with a particular focus on the questions of scaling up using the tools and techniques we have covered in this chapter.
First, we have the ingestion of the data. For this scenario, we will assume that the customer data, including interactions with different applications and systems, is processed at the end of the business day and numbers millions of records. This data contains numerical and categorical values and these need to be processed in order to feed into the downstream ML algorithm. If the data is partitioned by date, and maybe some other feature of the data, then this plays really naturally into the use of Spark, as you can read this into a Spark DataFrame and use the partitions to parallelize the data processing steps.
Next, we have the feature engineering. If we are using a Spark DataFrame in the first step, then we can apply our transformation logic using the base PySpark syntax we have discussed earlier in this chapter. For example, if we want to apply some feature transformations available from Scikit-Learn or another ML library, we can wrap these in UDFs and apply at the scale we need to. The data can then be exported in our chosen data format using the PySpark API. For the customer churn model, this could mean a combination of encoding of categorical variables and scaling of numerical variables, in line with the techniques explored in Chapter 3, From Model to Model Factory.
Switching into the training of the model, now are moving from the data-intensive to the compute-intensive tasks. This means it is natural to start using Ray for model training, as you can easily set up parallel tasks to train models with different hyperparameter settings and distribute the training steps as well. There are particular benefits to using Ray for training deep learning or tree-based models as these are algorithms that are amenable to parallelization. So, if we are performing classification using one of the available models in Spark ML, then this can be done in a few lines, but if we are using something else, we will likely need to start wrapping in UDFs. Ray is far more library-agnostic but again the benefits really come if we are using a neural network in PyTorch or TensorFlow or using XGBoost or LightGBM, as these more naturally parallelize.
Finally, onto the model inference step. In a batch setting, it is less clear who the winner is in terms of suggested framework here. Using UDFs or the core PySpark APIs, you can easily create a quite scalable batch prediction stage using Apache Spark and your Spark cluster. This is essentially because prediction on a large batch is really just another large-scale data transformation, where Spark excels. If, however, you wish to serve your model as an endpoint that can scale across a cluster, this is where Ray has very easy-to-use capabilities as shown in the Scaling your serving layer with Ray section. Spark does not have a facility for creating endpoints in this way and the scheduling and task overheads required to get a Spark job up and running mean that it would not be worth running Spark on small packets of data coming in as requests like this.
For the customer churn example, this may mean that if we want to perform a churn classification on the whole customer base, Spark provides a nice way to process all of that data and leverage concepts like the underlying data partitions. You can still do this in Ray, but the lower-level API may mean it is slightly more work. Note that we can create this serving layer using many other mechanisms, as discussed in Chapter 5, Deployment Patterns and Tools, and the section on Spinning up serverless infrastructure in this chapter. Chapter 8, Building an Example ML Microservice, will also cover in detail how to use Kubernetes to scale out a deployment of an ML endpoint.
Finally, I have called the last stage the application layer to cover any “last mile” integrations between the output system and downstream systems in the solution. In this case, Spark does not really have a role to play since it can really be thought of as as a large-scale data transformation engine. Ray, on the other hand, has more of a philosophy of general Python acceleration, so if there are tasks that would benefit from parallelization in the backend of your applications, such as data retrieval, general calculations, simulation, or some other process, then the likelihood is you can still use Ray in some capacity, although there may be other tools available. So, in the customer churn example, Ray could be used for performing analysis at the level of individual customers and doing this in parallel before serving the results through a Ray Serve endpoint.
The point of going through this high-level example was to highlight the points along your ML engineering project where you can make choices about how to scale effectively. I like to say that there are often no right answers, but very often wrong answers. What I mean by this is that there are often several ways to build a good solution that are equally valid and may leverage different tools. The important thing is to avoid the biggest pitfalls and dead ends. Hopefully, the example gives some indication of how you can apply this thinking to scaling up your ML solutions.
IMPORTANT NOTE
Although I have presented a lot of questions here in terms of Spark vs. Ray, with a nod to Kubernetes as a more base infrastructure scaling option, there is now the ability to combine Spark and Ray through the use of RayDP. This toolkit now allows you to run Spark jobs on Ray clusters, so it nicely allows you to still use Ray as your base scaling layer but then leveRage the Spark APIs and capabilities where it excels. RayDP was introduced in 2021 and is in active development, so this is definitely a capability to watch. For more information, see the project repository here: https://github.com/oap-project/raydp.
This concludes our look at how we can start to apply some of the scaling techniques we have discussed to our ML use cases.
We will now finish the chapter with a brief summary of what we have covered in the last few pages.