Data Models and Structured Data

When you build an analytical solution, the first thing that you need to do is to build a data model. A data model is an overview of the data sources that you will be using, their relationships with other data sources, where exactly the data from a specific source is going to be fetched, and in what form (such as an Excel file, a database, or a JSON from an internet source).

Note

Keep in mind that the data model evolves as data sources and processes change.

A data model can contain data of the following three types:

Structured Data: Also known as completely structured or well-structured data, this is the simplest way to manage information. The data is arranged in a flat tabular form with the correct value corresponding to the correct attribute. There is a unique column, known as an index, for easy and quick access to the data, and there are no duplicate columns. For example, in Figure 1.1, employee_id is the unique column. Using the data in this column, you can run SQL queries and quickly access data at a specific row and column in the dataset easily. Furthermore, there are no empty rows, missing entries, or duplicate columns, thereby making this dataset quite easy to work with. What makes structured data most ubiquitous and easy to analyze is that it is stored in a standardized tabular format that makes adding, updating, deleting, and updating entries easy and programmable. With structured data, you may not have to put in much effort during the data preparation and cleaning stage.
Data stored in relational databases such as MySQL, Amazon Redshift, and more are examples of structured data:

Figure 1.1: Data in a MySQL table

Semi-structured data: You will not find semi-structured data to be stored in a strict, tabular hierarchy as you saw in Figure 1.1. However, it will still have its own hierarchies that group its elements and establish a relationship between them. For example, metadata of a song may include information about the cover art, the artist, song length, and even the lyrics. You can search for the artist's name and find the song you want. Such data does not have a fixed hierarchy mapping the unique column with rows in an expected format, and yet you can find the information you need.
Another example of semi-structured data is a JSON file. JSON files are self-describing and can be understood easily. In Figure 1.2, you can see a JSON file that contains personally identifiable information of Jack Jones.
Semi-structured data can be stored accurately in NoSQL databases.

Figure 1.2: Data in a JSON file

Unstructured data: Unstructured data may not be tabular, and even if it is tabular, the number of attributes or columns per observation may be completely arbitrary. The same data could be represented in different ways, and the attributes might not match each other, with values leaking into other parts.
For example, think of reviews of various products stored in rows of an Excel sheet or a dump of the latest tweets of a company's Twitter profile. We can only search for specific keywords in that data, but we cannot store it in a relational database, nor will we be able to establish a concrete hierarchy between different elements or rows. Unstructured data can be stored as text files, CSV files, Excel files, images, and audio clips.

Marketing data, traditionally, comprises all three aforementioned data types. Initially, most data points originate from different data sources. This results in different implications, such as the values of a field could be of different lengths, the value for one field would not match that of other fields because of different field names, and some rows might have missing values for some of the fields.

You'll soon learn how to effectively tackle such problems with your data using Python. The following diagram illustrates what a data model for marketing analytics looks like. The data model comprises all kinds of data: structured data such as databases (top), semi-structured data such as JSON (middle), and unstructured data such as Excel files (bottom):

Figure 1.3: Data model for marketing analytics

As the data model becomes complex, the probability of having bad data increases. For example, a marketing analyst working with the demographic details of a customer can mistakenly read the age of the customer as a text string instead of a number (integer). In such situations, the analysis would go haywire as the analyst cannot perform any aggregation functions, such as finding the average age of a customer. These types of situations can be overcome by having a proper data quality check to ensure that the data chosen for further analysis is of the correct data type.

This is where programming languages such as Python come into play. Python is an all-purpose general programming language that integrates with almost every platform and helps automate data production and analysis.

Apart from understanding patterns and giving at least a basic structure to data, Python forces the data model to accept the right value for the attribute. The following diagram illustrates how most marketing analytics today structure different kinds of data by passing it through scripts to make it at least semi-structured:

Figure 1.4: Data model of most marketing analytics that use Python

By making use of such structure-enforcing scripts, you will have a data model of semi-structured data coming in with expected values in the right fields; however, the data is not yet in the best possible format to perform analytics. If you can completely structure your data (that is, arrange it in flat tables, with the right value pointing to the right attribute with no nesting), it will be easy to see how every data point individually compares to other points with the help of common fields. You can easily get a feel of the data—that is, see in what range most values lie, identify the clear outliers, and so on—by simply scrolling through it.

While there are a lot of tools that can be used to convert data from an unstructured/semi-structured format to a fully structured format (for example, Spark, STATA, and SAS), the tool that is most widely used for data science, and which can be integrated with practically any framework, has rich functionalities, minimal costs, and is easy to use in our use case, is pandas.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Data Science for Marketing Analytics [Instructor Edition]

By : Tommy Blanchard, Pranshu Bhatnagar, Debasish Behera

Data Science for Marketing Analytics [Instructor Edition]

By: Tommy Blanchard, Pranshu Bhatnagar, Debasish Behera

Overview of this book

Data Models and Structured Data

Data Science for Marketing Analytics [Instructor Edition]

By : Tommy Blanchard, Pranshu Bhatnagar, Debasish Behera

Data Science for Marketing Analytics [Instructor Edition]

By: Tommy Blanchard, Pranshu Bhatnagar, Debasish Behera

Overview of this book

Data Models and Structured Data

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access