Designing a preliminary data model

Earlier in this chapter, we made two simple assumptions about data modeling:

Most of the time, a star schema design will provide the most efficient storage and query performance for business intelligence data models
Basic ML models, such as the ones you can build in this book, are usually created with a flattened table

Now that you have a grasp of the underlying data and requirements, it is time to think about the data model for your FAA Wildlife Strike data solution. Logically, you can describe your tables of data as follows:

STRIKE_REPORTS (from wildlife.accdb): Each row represents a report that was filed. The table of data contains both descriptive values (date, location, and type) along with values that can be summed up and averaged (height and costs).
Engine Codes (from read-me.xls): This contains information about the aircraft engines that can be tied to STRIKE_REPORTS.
Aircraft Type (from read-me.xls): This contains information about the aircraft that can be tied to STRIKE_REPORTS.
Engine Position (from read-me.xls): This contains information about the aircraft engine positions that can be tied to STRIKE_REPORTS.

At this point, you are faced with some data model choices. No matter what decision you make, some people might question your architecture, since there is no perfect design. Depending on how end users will use the data, the data model design may change. This book will demonstrate some of the differences in data model designs for ML models versus traditional BI designs. At a high level, there are three basic approaches you can take in Power BI:

Flatten: You can flatten all the data onto a single table by joining Engine Codes, Aircraft Type, and Engine Position onto STRIKE_REPORTS.
Star schema: You can build out a true star schema with STRIKE_REPORTS as a fact table and Engine Codes, Aircraft Type, and Engine Position as dimension tables. Some additional data from STRIKE_REPORTS would also be broken out into separate dimension tables. For example, AIRPORT_ID, AIRPORT, STATE, and FAAREGION could be separate dimension tables.
Hybrid design: You can build out a hybrid design using both a flattened and star schema design pattern for the sake of practicality and ease of use.

Let’s look at each of these in turn.

Flattening the data

Flattening the FAA Wildlife Strike reports’ data would require joining the Engine Codes, Aircraft Type, and Engine Position tables onto the STRIKE_REPORTS table so that everything is on one big flat table of data. The result would be something that looks like this:

Figure 1.12 – Tables on the left are combined to form a single table on the right

The following table contains some, but not all, of the pros and cons of a flattened table of data:

Pros	Cons
Simplicity No joins needed for code Commonly used by data scientists Can compress well with columnar databases No relational data models for business users	Repetitive data can lead to an inefficient storage footprint Limitations for queries with advanced logic Less flexibility for future change and evolution of solution Complex logical queries can be less efficient

Figure 1.13 – Pros and cons of a flattened table for BI

Next, let’s look at the star schema.

Star schema

A true star schema built to best practices would include relationships between Engine Codes, Aircraft Type, and Engine Position with the STRIKE_REPORTS table. It would also break off parts of the STRIKE_REPORTS table into smaller dimension tables. The following figure is a representation of the approach for a true star schema. There may be more dimension tables that would need to be broken off of the STRIKE_REPORTS table in addition to Location and Species, but this is an example of how it might look:

Figure 1.14 – Tables on the left are combined into a star schema, and some data is split off into new dimension tables

The following table contains some, but not all, of the pros and cons of a true star schema design:

Pros	Cons
Tables often line up with business logic Balance of minimal data duplication and efficient queries Usually expandable if the scope of the solution grows and new data is introduced Traditionally considered the gold standard for BI data models	With modern tools, the benefits of reducing data duplication are less impactful versus older tools Complicated ETL Machine learning models are usually trained with flat tables of data Don’t always scale well with very large data volumes having tens of billions of rows

Figure 1.15 – Pros and cons of a star schema for BI

Hybrid design

For the FAA Wildlife Strike data, combining aspects of a flattened design and a star schema is also an option. At the time of this book’s writing, the entire STRIKE_REPORTS table is fewer than 300,000 rows and has fewer than 100 columns. Only two columns contain verbose free text, so data volume is not an issue when using Power BI. For this particular use case, the differences in data storage requirements between flattened and star schema data models are minimal. With data volumes of this small size, you can design the data model to meet the needs of the solution without some of the performance concerns that would be introduced for data sources with tens of millions of rows or hundreds of columns containing free text fields. Columns of data left on a transaction table that will be used as categories are technically called degenerate dimensions. A hybrid design could look something like the following example:

Figure 1.16 – Tables from the left are enhanced and combined into a star schema with some descriptive data still in the fact table

The following table contains some, but not all, of the pros and cons of a hybrid design:

Pros	Cons
Rapid prototyping Less logic in the data transformation layer Flexible design	Possibly less performant than a star schema for traditional BI Additional logical complexity for users versus a big flat table Data will still need to be flattened out for machine learning Not perfect for either BI or ML, but a compromise between the two

Figure 1.17 – Pros and cons of a hybrid design and considerations for additional data

Before finalizing a preliminary logical design for your FAA Wildlife Strike solution, take a step back to think about the data and the requirements. You can review the expected deliverables from earlier in the chapter, including an analytic report and predictions of damage, size, and height.

In addition to the FAA Wildlife Strike data you’ve been using, what other data might be useful for the solution? Also, what is the effort to get the data? Here are a few examples that you could research:

Additional Data Sources	Level of Effort
Date-based table of aggregations such as Month, Quarter, Season, and Holidays	Easy
Time-based table of aggregations such as hour, AM/PM, and so on	Easy
Data for flights that didn’t have a wildlife strike could provide a baseline for the percentage of flights with strikes	Difficult
Weather data that could be mapped to the date and time of wildlife strikes	Difficult
Additional data about wildlife species such as weight ranges, habitat ranges, and so on	Difficult

Figure 1.18 – Additional potential data sources for the solution

Additional flight, weather, and wildlife data could provide greater analytic and predictive value for the solution. However, adding those sources would require quite a bit of effort that exceeds the scope of your project and the length of this book. If the initial project goes well, you can circle back to these options for future enhancements.

There may be value in adding a Time table to the solution, so open up Power Query and take another look at the TIME column. Notice that 95% of the entries are empty:

Figure 1.19 – 95% of the values for TIME are empty

Due to a lack of complete data, you decide to leave a Time table out of the initial build.

How about a Date table so that you can roll up data by week, month, quarter, year, holidays, weekends, and more? Looking at the INCIDENT_DATE column in Power Query, it is populated for every entry in the preview:

Figure 1.20 – INCIDENT_DATE is fully populated with date values

INCIDENT_DATE can be used as a key for connecting to a Date table containing many different date-based aggregations. You decide to pull in a Date table for the architecture. The resulting preliminary data model will now look as follows:

Figure 1.21 – A Date table is added to the preliminary data model

The Date table was not present in the source data, but in your reporting model, it will allow you to slice and dice data by day, week, month, quarter, year, weekend, and more. When you explore data in future chapters, it will add new ways to dive into and explore date-based trends. The Date table will be added in Chapter 2.

In the final section of the chapter, we’ll look at what else we need to take into account for ML.

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

Power BI Machine Learning and OpenAI

By : Greg Beaumont

Power BI Machine Learning and OpenAI

By: Greg Beaumont

Overview of this book

Designing a preliminary data model

Flattening the data

Star schema

Hybrid design

Power BI Machine Learning and OpenAI

By : Greg Beaumont

Power BI Machine Learning and OpenAI

By: Greg Beaumont

Overview of this book

Designing a preliminary data model

Flattening the data

Star schema

Hybrid design

Confirmation

Buy this book with your credits?

Submit Your Feedback

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access