Book Image

Architecting Data-Intensive Applications

By : Anuj Kumar
Book Image

Architecting Data-Intensive Applications

By: Anuj Kumar

Overview of this book

<p>Are you an architect or a developer who looks at your own applications gingerly while browsing through Facebook and applauding it silently for its data-intensive, yet ?uent and efficient, behaviour? This book is your gateway to build smart data-intensive systems by incorporating the core data-intensive architectural principles, patterns, and techniques directly into your application architecture.</p> <p>This book starts by taking you through the primary design challenges involved with architecting data-intensive applications. You will learn how to implement data curation and data dissemination, depending on the volume of your data. You will then implement your application architecture one step at a time. You will get to grips with implementing the correct message delivery protocols and creating a data layer that doesn’t fail when running high traffic. This book will show you how you can divide your application into layers, each of which adheres to the single responsibility principle. By the end of this book, you will learn to streamline your thoughts and make the right choice in terms of technologies and architectural principles based on the problem at hand.</p>
Table of Contents (18 chapters)
Title Page
Packt Upsell
Contributors
Preface
Index

What constitutes a data ecosystem?


Nowadays, data comes from a variety of sources, at varying speeds, and in a number of different formats. Understanding data and its relevance is the most important task for any data-driven organization. 

To understand the importance of data, the data czars in an organization should look at all possible sources of data that may be important to them. Being far-sighted helps, although, given the pace of modern society, it is almost impossible to gather data from every relevant source. Hence, it is important that the person/people involved in identifying relevant data sources are also well aware of the business landscape in which they operate. This knowledge will help tremendously in averting problems later. Data source identifiers should also be aware that data can be sourced both inside and outside of an organization, since, at the broadest level, data is first classified as being either internal or external data.

Given the opportunity, internal data should first be converted into information. Handling internal data first helps the organization to understand its importance early in the life cycle, without needing to set up a complex system, thereby making the process agile. In addition, it also gives the technical team an opportunity to understand what technology and architecture would be most appropriate in their situation. Such a distinction also helps organizations to not only put a reliability rating on data, but also to define any security rules in connection with the data.

So, what are the different sources of data that an organization can utilize to its benefit? The following diagram depicts a part of the landscape that constitutes the data ecosystem. I say "a part" because the landscape is so huge that listing all of them would not be possible:

The preceding mentioned data can be categorized as internal or external, depending upon the business segment in which an organization is involved. For example, as regards an organization such as Facebook, all the social media-related data on its website would constitute an internal source, whereas the same data for an advertising firm would represent an external source of data.

As you may have already noticed, the preceding set of data can broadly be classified into three sub-categories:

Structured data

This type of data contains a well-defined structure that can be parsed easily by any standard machine parser. This type of data usually comes with a schema that defines the structure of the data. For example, incoming data in XML format having an associated XML schema constitutes what is known as structured data. Examples of such data include Customer Relationship Management (CRM) data, and ERP data.

 

 

Semi-structured data

Semi-structured data consists of data that does not have a formal schema associated with it. Log data from different machines can be regarded as semi-structured data. For example, a firewall log statement consists of the following fields as a minimum: the timestamp, host IP, destination IP, host port, and destination port, as well as some free text describing the event that took place resulting in the generation of the log statement. 

Unstructured data

Finally, we have data that is unstructured. When I say unstructured, what I really mean is that, looking at the data, it is hard to derive any structured information directly from the data itself. It does not mean that we can't get information from the unstructured data. Examples of unstructured data include video files, audio files, and blogs, while most of the data generated on social media also falls under the category of unstructured data.

One thing to note about any kind of data is that, more often than not, each piece of data will have metadata associated with it. For example, when we take a picture using our cellphone, the picture itself constitutes the data, whereas its properties, such as when it was taken, where it was taken, what the focal length was, its brightness, and whether it was modified by software such as Adobe Photoshop, constitutes its metadata.

Sometimes, it is also difficult to clearly categorize data. For example, the scenario where a security firm that sells hardware appliances to its customers that is installed at the customer location and collects access log data constitutes one such scenario where it is difficult to categorize data. It is data for the end customer that the customer has given permission to be used for a specific purpose and that is used to detect a security threat. Thus, even though the data resides at the security organization, it still cannot be used (without consent) for any purpose other than to detect a threat for that specific customer.

This brings us to our next topic: data sharing.

 

 

Data sharing

Whenever we collect data from an external source, there is always a clause about how that data can be used. At times, this aspect is implicit, but there are times when you need to provide an explicit mechanism for how the data can be shared by the collecting organization, both within and outside the organization. This distinction becomes important when data is shared between specific organizations. For example, one particular financial institution may decide to share certain information with another financial institution because both are part of a larger consortium that requires them to work collectively towards combating cyber threats. Now, the data on cyber threats that is collected and shared by these organizations may come with certain restrictions. Namely:

  • When should the shared data be used?
  • How may this data be shared with other parties, both within and outside an organization?

There are numerous ways in which this sharing agreement can be agreed upon by organizations. Two such ways, that are defined and used by many organizations, are:

  • The traffic light protocol and
  • The information exchange policy framework from first.org

Let's discuss each of these briefly.

Traffic light protocol

The traffic light protocol (hereinafter referred to as TLP, https://www.us-cert.gov/tlp and https://www.first.org/tlp) is a set of designations used to ensure that sensitive information is shared with the appropriate audience. TLP was created to facilitate the increased sharing of information between organizations. It employs four colors to indicate the expected sharing boundaries to be applied by the recipient(s):

  • RED
  • AMBER
  • GREEN
  • WHITE

TLP provides a simple and intuitive schema for indicating when and how sensitive information can be shared, thereby facilitating more frequent and effective collaboration. TLP is not a control marking or classification scheme. TLP was not designed to handle licensing terms, handling and encryption rules, and restrictions on action or instrumentation of information. TLP labels and their definitions are not intended to have any effect on freedom of information or sunshine laws in any jurisdiction.

TLP is optimized for ease of adoption, human readability, and person-to-person sharing; it may be used in automated sharing exchanges, but is not optimized for such use.

The source is responsible for ensuring that recipients of TLP information understand and can follow TLP sharing guidance.

If a recipient needs to share the information more widely than is indicated by the original TLP designation, they must obtain explicit permission from the original source.

The United States Computer Emergency Readiness Team provides the following definition of TLP, along with its usage and sharing guidelines:

TLP color

When it should be used

How it may be shared

RED

Not for disclosure, restricted to participants only.

Sources may use TLP:RED when information cannot be effectively acted upon by additional parties, and could impact on a party's privacy, reputation, or operations if misused.

Recipients may not share TLP:RED information with any parties outside of the specific exchange, meeting, or conversation in which it was originally disclosed. In the context of a meeting, for example, TLP:RED information is limited to those present at the meeting. In most circumstances, TLP:RED should be exchanged verbally or in person.

AMBER

Limited disclosure, restricted to participants' organizations.

Sources may use TLP:AMBER when information requires support to be effectively acted upon, yet carries risks to privacy, reputation, or operations if shared outside of the organizations involved.

Recipients may only share TLP:AMBER information with members of their own organization, and with clients or customers who need to know the information to protect themselves or prevent further harm. Sources are at liberty to specify additional intended limits associated with the sharing: these must be adhered to.

GREEN

Limited disclosure, restricted to the community.

Sources may use TLP:GREEN when information is useful for making all participating organizations, as well as peers within the broader community or sector, aware.

Recipients may share TLP:GREEN information with peers and partner organizations within their sector or community, but not via publicly accessible channels. Information in this category can be circulated widely within a particular community. TLP:GREEN information may not be released outside of the community.

WHITE

Disclosure is not limited.

Sources may use TLP:WHITE when information carries minimal or no foreseeable risk of misuse, in accordance with applicable rules and procedures for public release.

Subject to standard copyright rules, TLP:WHITE information may be distributed without restriction.

 

Remember that this is guidance and not a rule. Therefore, if an organization feels the need to have further types of restrictions, it may certainly do so, provided the receiving entity is either aware of them and is not exposed to the extension.