Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying The Artificial Intelligence Infrastructure Workshop
  • Table Of Contents Toc
The Artificial Intelligence Infrastructure Workshop

The Artificial Intelligence Infrastructure Workshop

By : Chinmay Arankalle , Gareth Dwyer , Bas Geerdink , Kunal Gera , Kevin Liao , Anand N.S.
4.3 (3)
close
close
The Artificial Intelligence Infrastructure Workshop

The Artificial Intelligence Infrastructure Workshop

4.3 (3)
By: Chinmay Arankalle , Gareth Dwyer , Bas Geerdink , Kunal Gera , Kevin Liao , Anand N.S.

Overview of this book

Social networking sites see an average of 350 million uploads daily - a quantity impossible for humans to scan and analyze. Only AI can do this job at the required speed, and to leverage an AI application at its full potential, you need an efficient and scalable data storage pipeline. The Artificial Intelligence Infrastructure Workshop will teach you how to build and manage one. The Artificial Intelligence Infrastructure Workshop begins taking you through some real-world applications of AI. You’ll explore the layers of a data lake and get to grips with security, scalability, and maintainability. With the help of hands-on exercises, you’ll learn how to define the requirements for AI applications in your organization. This AI book will show you how to select a database for your system and run common queries on databases such as MySQL, MongoDB, and Cassandra. You’ll also design your own AI trading system to get a feel of the pipeline-based architecture. As you learn to implement a deep Q-learning algorithm to play the CartPole game, you’ll gain hands-on experience with PyTorch. Finally, you’ll explore ways to run machine learning models in production as part of an AI application. By the end of the book, you’ll have learned how to build and deploy your own AI software at scale, using various tools, API frameworks, and serialization methods.
Table of Contents (14 chapters)
close
close
Preface
4
4. The Ethics of AI Data Storage

Data Layers

An AI system consists of multiple data storage layers that are connected with Extract, Transform, and Load (ETL) or Extract, Load, and Transform (ELT) pipelines. Each separate storage solution has its own requirements, depending on the type of data that is stored and the usage pattern. The following figure shows this concept:

Figure 2.3: Conceptual overview of the data layers in a typical AI solution

Figure 2.3: Conceptual overview of the data layers in a typical AI solution

From a high-level viewpoint, the backend (and thus, the storage systems) of an AI solution is split up into three parts or layers:

  • Raw data layer: Contains copies of files from source systems. Also known as the staging area.
  • Historical data layer: The core of a data-driven system, containing an overview of data from multiple source systems that have been gathered over time. By stacking the data rather than replacing or updating old values, history is preserved and time travel (being able to make queries over a data state in the past) is made possible in the data tables.
  • Analytics data layer: A set of tools that are used to get access to the data in the historical data layer. This includes cache tables, views (virtual or materialized), queries, and so on.

These three layers contain the data in production. For model development and training, data can be offloaded into a special model development environment such as a DataBricks cluster or SageMaker instance. In that case, an extra layer can be added:

  • Model training layer: A set of tools (databases, file stores, machine learning frameworks, and so on) that allows data scientists to build models and train them with massive amounts of data.

For scenarios where data is not being ingested by the system per batch but rather streamed in continuously, such as a system that processes sensory machine data from a factory, we must set up specific infrastructure and software. In those cases, we will use a new layer that takes the role of the raw data layer:

  • Streaming data layer: An event bus that can store large amounts of continuously inflowing data streams, combined with a streaming data engine that is able to get data from the event bus in real time and analyze it. The streaming data engine can also read and write data to data stores in other layers, for example, to combine the real-time data from the event bus with historical data about customers from a historical data view.

Depending on the requirements for data storage and analysis, for each layer, a different technology set can be picked. The data stores don't have to be physical file stores or databases. An in-memory database, graph database, or even a virtual view (just queries) can be considered as a proper data storage mechanism. Working with large datasets in complex machine learning algorithms requires special attention since the models to be trained require the storage and usage of big datasets, but for a relatively short period of time.

To summarize, the data layers of AI solutions are like many other data-driven architecture layers, with the addition of the model training layer and possibly the streaming data layer. But there is a bigger shift happening, namely the one from data warehouses to data lakes. We'll explore that paradigm shift in the next part of this chapter.

From Data Warehouse to Data Lake

The common architecture of a data processing system is shifting from traditional data warehouses that run on-premise toward modern data lakes in the cloud. This shift is being made by a new technology wave that started in the "big data" era with Hadoop in around 2006. Since then, more and more technology has arrived that makes it easier to store and process large datasets and to build models efficiently, making use of advanced concepts such as distributed data, in-memory storage, and graphical processing units (GPUs). Many organizations saw the opportunities of these new technologies and started migrating their report-driven data warehouses toward data lakes with the aim of becoming more predictive and to get more value out of their data.

Next to a technology shift, what we see is a move toward more virtual data layers. Since computing power has become cheap and parallel processing is now a common practice, it's possible to virtualize the analytics layer instead of storing the data on disk. The following figure illustrates this; while the patterns and layers are almost the same, the technology and approach differ:

Figure 2.4: Data pipelines in a data warehouse and a data lake

Figure 2.4: Data pipelines in a data warehouse and a data lake

The following table highlights the similarities and differences between the data platforms:

Figure 2.5 Comparison between data warehouses and data lakes

Figure 2.5 Comparison between data warehouses and data lakes

Note

The preceding table has been written to highlight the differences between data warehouses and date lakes, but there is a "gray zone." There are many data solutions that don't fall into either of the extremes; for example, ETL pipelines that run in the cloud.

To understand data layers better, let's revisit our bank example. The architects of PacktBank started by defining a new data lake architecture that should be the data source for all AI projects. Since the bank did not foresee any streaming project on short notice, the focus was on batch; every source system had to upload a daily export to the data lake. This data was then transformed into a relational data vault with time-traveling possibilities. Finally, a set of views allowed quick access to the historical data for a set of use cases. The data scientists were building models on top of the data from these views, which was temporarily stored in a model development environment (Amazon SageMaker). Where needed, the data scientists could also get access to the raw data. All these access points were regulated with role-based access and were monitored extensively. The following figure shows the final architecture of the solution:

Figure 2.6: Sample data lake architecture

Figure 2.6: Sample data lake architecture

Exercise 2.01: Designing a Layered Architecture for an AI System

The purpose of this exercise is to get acquainted with the layered architecture that is common for large AI systems.

For this exercise, imagine that you must design a new system for a telecom organization; let's call it PacktTelecom. The company wants to analyze internet traffic, call data, and text messages on a daily basis and perform predictive analysis on the dataset to forecast the load on the network. The data itself is produced by the clients of the company, who are using their smartphones on the network. The company is not interested in the content of the traffic itself, but most of the information at the meta-level is interesting. The AI system is considered part of a new data lake, which will be created with the aim of supporting many similar use cases in the future. Data from multiple sources should be combined and analyzed in reports and made available to websites and mobile apps through a set of APIs.

Now, answer the following questions for this use case:

  1. What is the data source of the use case? Which systems are producing data, and in which way are they sending data to the data lake?

    The prime data source is the smartphones of the clients. The smartphones are continuously connected to the network of the company and send their metadata to the core systems (for example, internet traffic and text messages). These core systems will send a daily batch of data to the data lake; if required, this can be made real-time streaming at a later stage.

  2. Which data layers should you use for the use case? Is it a streaming infrastructure or a more traditional data warehousing scenario?

    A raw data layer stores the daily batches that are sent from the core systems. A historical data layer is used to build a historical overview per customer. An analytics layer is used to query the data in an efficient way. At a later stage, a streaming infrastructure can be realized to replace the daily batches.

  3. What data preparation steps need to be done in the ETL process to get the raw data in shape so that it's useful to work with?

    The raw data needs to be cleaned; the content must be removed. Some metadata might have to be added from other data sources, for example, client information or sales data.

  4. Are there any models that need to be trained? If so, which layer are they getting their data from?

    To forecast the network load, a machine learning model needs to be created that is trained on the daily data. This data is gathered from the historical data layer.

By completing this exercise, you have reasoned about the layers in an AI system and (partly) designed an architecture.

Requirements per Infrastructure Layer

Let's dive into the requirements for each part of a data solution. Depending on the actual requirements per layer, an architect can choose the technology options for the layer. We'll list the most important requirements per category here.

Some requirements apply to all data layers and can, therefore, be considered generic. For example, scalability and security are always important for a data-driven system. However, we've chosen to list them for each layer separately because each layer has many specific attention points for these generic requirements as well.

The following table highlights the most important requirements per data layer:

Figure 2.7: Important requirements per data layer

Figure 2.7: Important requirements per data layer

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
The Artificial Intelligence Infrastructure Workshop
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist download Download options font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon