Book Image

Principles of Strategic Data Science

By : Peter Prevos
Book Image

Principles of Strategic Data Science

By: Peter Prevos

Overview of this book

Mathematics and computer science form an integral part of data science, and understanding them is crucial for efficiently managing data. This book is designed to take you through the entire data science pipeline and help you join the dots between mathematics, programming, and business analysis. You’ll start by learning what data science is and how organizations can use it to revolutionize the way they use their data. The book then covers the criteria for the soundness of data products and demonstrates how to effectively visualize information. As you progress, you’ll discover the strategic aspects of data science by exploring the five-phase framework that enables you to enhance the value you extract from data. Toward the concluding chapters, you’ll understand the role of a data science manager in helping an organization take the data-driven approach. By the end of this book, you’ll have a good understanding of data science and how it can enable you to extract value from your data.
Table of Contents (6 chapters)

The Elements of Data Science

Now that we have defined data science within the context of managing a business, we can start describing the elements of data science. The best way to unpack the art and craft of data science is Drew Conway's often-cited Venn diagram, as shown in Figure 1.3. (Conway, D. (2010). (The data science Venn diagramhttp://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram). Downloaded 27 January 2019)

Conway defines three competencies that a data scientist, or a data science team as a collective, need to possess. The diagram positions data science as an interdisciplinary activity with three dimensions: domain knowledge, mathematics, and computer science. A data scientist is somebody who understands the subject matter under consideration in mathematical terms and writes computer code to solve problems.

Figure 1.3: Conway’s data science Venn diagram
Figure 1.3: Conway's data science Venn diagram

Domain Knowledge

The most significant skill within a data science function is domain knowledge. While the results of advanced applied mathematics such as machine learning are impressive, without understanding the reality that these models describe, they are devoid of meaning and can cause more harm than good. Anyone analyzing a problem needs to understand the context of the issues and the potential solutions. The subject of data science is not the data itself, but the reality this data describes. Data science is about things and people in the real world, not about numbers and algorithms.

A domain expert understands the impact of any confounding variables on the outcomes. An experienced subject-matter expert can quickly perform a sanity check on the process and results of the analysis. Domain knowledge is essential because each area of expertise uses a different paradigm to understand the world.

Each domain of human enquiry or activity has different methodologies to collect and analyze data. Analyzing objective engineering data follows a different approach to subjective data about people or unstructured data in a corpus of text. The analyst needs to be familiar with the tools of the trade within the problem domain. The example of a graduate professional beating a team of machine learning experts with a linear regression shows the importance of domain knowledge.

Domain expertise can also become a source of bias and prevent innovative ways of looking at information. Solutions developed through systematic research can contradict long-held beliefs about a specific topic that are sometimes hard to shift. Implementing data science is thus as much a cultural process as it is a scientific one, which is the topic of Chapter 4, The Data-Driven Organization.

Mathematical Knowledge

The analyst uses mathematical skills to convert data into actionable insights. Mathematics consists of pure mathematics as a science, and applied mathematics that helps us to solve problems. The scope of applied mathematics is broad, and data science is opportunistic in choosing the most suitable method. Various types of regression models, graph theory, k-means clustering, decision trees, and so on, are some of the favorite tools of a data scientist. The creative application of complex applied mathematics is one of the two distinguishing factors between traditional business analysis and data science.

Combining subject-matter expertise with mathematical skills is the domain of traditional research and analysis. The notion of conventional research is, however, evolving toward using the principles of data science by using reproducible computer code and sharing the source data through websites such as FigShare (https://figshare.com/).

Numbers are the foundations of mathematics, and the craft of quantitative science is to describe our analogue reality in a model that we can manipulate to predict the future. Not all mathematical skills are necessarily about numbers but can also revolve around logical relationships between words and concepts. Contemporary numerical methods help us to understand relationships between people, the logical structure of a text, and many other aspects beyond the realm of traditional numeric analysis.

Computer Science

Not that long ago, most of the information collected by an organization was stored on paper and archived in copious volumes of arch lever files. Analyzing this information was an arduous task that involved many hours of transcribing information into a format that is useful for analysis.

In the twenty-first century, almost all data is an electronic resource. To create value from this resource, data engineers extract it from a database, combine it with other sources, and clean the data before analysts can make sense of it. This requirement implies that a data scientist needs to have computing skills. Conway uses the term hacking skills, which many people interpret as negative. Conway is, however, not referring to a hacker in the sense of somebody who nefariously uses computers, but in the original meaning of the word as a developer with creative computing skills. The core competency of a hacker, developer, coder, or whatever other term might be preferable, is algorithmic thinking and understanding the logic of data structures. These competencies are vital in extracting and cleaning data to prepare it for the next step of the data science process.

The importance of hacking skills for a data scientist implies that we should move away from point-and-click systems and spreadsheets and instead write code in a suitable programming language. The flexibility and power of a programming language far exceed the capabilities of graphical user interfaces and leads to reproducible analysis, as discussed in Chapter 2, Good Data Science.

The mathematical interpretation of reality needs to be translated into computer code. One of the factors that spearheaded data science into popularity is that the available toolkit has grown substantially in the past ten years. Open source computing languages such as R and Python can implement complex algorithms that were previously the domain of specialized software and supercomputers. Open source software has accelerated innovation in how we analyze data and has placed complex machine learning within reach of anyone who is willing to try to learn the skills.

Conway defines the danger zone as the area where domain knowledge and computing skills combine, without a good grounding in mathematics. Somebody might have enough computing skills to be pushing buttons on a business intelligence platform or spreadsheet. The user-friendliness of some analysis platforms can be detrimental to the outcomes of the analysis because they create the illusion of accuracy. Point-and-click analysis hides the inner workings from the user, creating a black-box result. Although the data might be perfectly structured, valid and reliable, a wrongly applied analytical method leads to useless outcomes.

The Unicorn Data Scientist?

Conway's diagram is often cited in the literature on data science. His simple model helped to define the craft of data science. Other data scientists have proposed more complex models, but they all originate with Conway's basic idea.

The diagram illustrates that the difference between traditional research skills or business analytics lies in the ability to understand and write code. A data scientist understands the problem they seek to resolve, they have the mathematical expertise to analyze the problem, and they possess the computing skills to convert this knowledge into outcomes.

It could be argued that the so-called skills are missing from this picture. However, communication, managing people, facilitating change and so on, are competencies that belong to every professional who works in a complex environment, not just the data scientist.

Some critics of this idea point out that these people are unicorns – that is, they don't exist. Data scientists that possess all these skills are mythical employees that don't exist in the real world. Most data scientists start from either mathematics or computer science, after which it is hard to become a domain expert. This book is written from the point of view that we can breed unicorns by teaching domain experts how to write code and, where required, enhance their mathematical skills.