Data analytics discussed in the previous section forms an important step in a data science project. In this section, we will explore the philosophy of data science and the standard life cycle of a data science project.
Data science is all about turning data into products. It is analytics and machine learning put into action to draw inferences and insights out of data. Data science is perceived to be an advanced step to business intelligence that considers all aspects of Big Data.
The following diagrams shows the various stages of data science life cycle that includes steps from data availability/loading to deriving and communicating data insights until operationalizing the process.
In this phase, we identify the key stakeholders and their interests, key pain points, goals for the project and failure criteria, success criteria, and key risks involved.
Initial hypotheses needs to be formed with the help of domain experts/key stakeholders; this would be the basis against which we would validate the available data. There would be variations of hypotheses that we would need to come up with as an initial step.
There would be a need to do a basic validation for the formed hypotheses and for this we would need to do a preliminary data exploration. We will deal with data exploration and process in the later chapters at length.
As a part of this initial step, we identify the kind of data we require to solve the problem in context. We would need to consider lifespan of data, volumes, and type of the data. Usually, there would be a need to have access to the raw data, so we would need access to the base data as against the processed/aggregated data. One of the important aspects of this phase is confirming the fact that the data required for this phase is available. A detailed analysis would need to be done to identify how much historic data would need to be extracted for running the tests against the defined initial hypothesis. We would need to consider all the characteristics of Big Data like volumes, varied data formats, data quality, and data influx speed. At the end of this phase, the final data scope would be formed by seeking required validations from domain experts.
The previous two phases define the analytic project scope that covers both business and data requirements. Now it's time for data exploration or transformation. It is also referred to as data preparation and of all the phases, this phase is the most iterative and time-consuming one.
During data exploration, it is important to keep in mind that there should be no interference with the ongoing organizational processes.
We start with gathering all kinds of data identified in phase 2 to solve the problem defined in phase 1.This data can be either structured, semi-structured, or unstructured, usually held in the raw formats as this allows trying various modeling techniques and derive an optimal one.
Extract, Transform, and Load: It is all about transforming data against a set of business rules before loading it into a data sandbox for analysis.
Extract, Load, and Transform: In this case, the raw data is loaded into a data sandbox and then transformed as a part of analysis. This option is more relevant and recommended over ETL as a prior data transformation would mean cleaning data upfront and can result in data condensation and loss.
Extract, Transform, Load, and Transform: In this case, we would see two levels of transformations:
Level 1 transformation could include steps that involve reduction of data noise (irrelevant data)
Level 2 transformation is similar to what we understood in ELT
In both ELT and ETLT cases, we can gain the advantage of preserving the raw data. One basic assumption for this process is that data would be voluminous and the requirement for tools and processes would be defined on this assumption.
The idea is to have access to clean data in the database to analyze data in its original form to explore the nuances in data. This phase requires domain experts and database specialists. Tools like Hadoop can be leveraged. We will learn more on the exploration/transformation techniques in the coming chapters.
In the model designing step, we would identify the appropriate/suitable model given a deep understanding of the requirement and data. This step involves understanding the attributes of data and the relationships. We will consider the inputs/data and then examine if these inputs correlate to the outcome we are trying to predict or analyze. As we aim to capture the most relevant variables/predictors, we would need to be vigilant for any data modeling or correlation problems. We can choose to analyze data using any of the many analytical techniques such as logistic regression, decision trees, neural networks, rule evolvers, and so on.
The next part of model design is the identification of the appropriate modeling technique. The focus will be on what data we would be running in our models, structured, unstructured, or hybrid.
As a part of building the environment for modeling, we would define data sets for testing, training, and production. We would also define the best hardware/software to run the tests such as parallel processing capabilities, and so on.
The second step of executing the model considers running the identified model against the data sets to verify the relevance of the model as well as the outcome. Based on the outcome, we would need further investigation on additional data requirements and alternative approaches to solving the problem in context.
Now comes the important part of the life cycle, communicating/publishing the key results/findings against the hypothesis defined in phase 1. We would consider presenting the caveats, assumptions, and limitations of the results. The results are summarized to be interpreted for a relevant target audience.
This phase requires identification of the right visualization techniques to best communicate the results. These results are then validated by the domain experts in the following phase.
An important outcome of this phase is the recommendations for future work.
In addition, this is the phase where you can underscore the business benefits of the work, and begin making the case to eventually put the logic into a live production environment.
As a result of this phase, we would have documented the key findings and major insights as a result of the analysis. The artifact as a result of this phase will be the most visible portion of the process to the outside stakeholders and sponsors, and hence should clearly articulate the results, methodology, and business value of the findings.
Execute a pilot of the previous formulation.
Run assessment of the outcome for benefits.
Publish the artifacts/insights.
Execute the model on production data.
Define/apply a sustenance model.