When developing a production ML system, it's not likely that you will have the training data handed to you in a ready-to-process format. Production ML systems are typically part of larger application systems, and the data that you use will probably originate from several different sources. The training set for an ML algorithm may be a subset of your larger database, combined with images hosted on a Content Delivery Network (CDN) and event data from an Elasticsearch server. In our examples, we have been given an isolated training set, but in the real world we will need to generate the training set in an automated and repeatable manner.
The process of ushering data through various stages of a life cycle is called data pipelining. Data pipelining may include data selectors that run SQL or Elasticsearch queries for objects, event subscriptions which allow data...