So why should you, a data developer, endeavor to think like (or more like) a data scientist? What is the significance of gaining an understanding of the ways and how's of statistics? Specifically, what might be the advantages of thinking like a data scientist?
The following are just a few notions supporting the effort for making the move into data science:
- Developing a better approach to understanding data
- Using statistical thinking during the process of program or database designing
- Adding to your personal toolbox
- Increased marketability
- Perpetual learning
- Seeing the future
Whether you are a data developer, systems analyst, programmer/developer, or data scientist, or other business or technology professional, you need to be able to develop a comprehensive relationship with the data you are working with or designing an application or database schema for.
Some might rely on the data specifications provided to you as part of the overall project plan or requirements, and still, some (usually those with more experience) may supplement their understanding by performing some generic queries on the data, either way, this seldom is enough.
In fact, in industry case studies, unclear, misunderstood, or incomplete requirements or specifications consistently rank in the top five as reasons for project failure or added risk.
Profiling data is a process, characteristic of data science, aimed at establishing data intimacy (or a more clear and concise grasp of the data and its inward relationships). Profiling data also establishes context to which there are several general contextual categories, which can be used to augment or increase the value and understanding of data for any purpose or project.
These categories include the following:
- Definitions and explanations: These help gain additional information or attributes about data points within your data
- Comparisons: This help add a comparable value to a data point within your data
- Contrasts: This help add an opposite to a data point to see whether it perhaps determines a different perspective
- Tendencies: These are typical mathematical calculations, summaries, or aggregations
- Dispersion: This includes mathematical calculations (or summaries) such as range, variance, and standard deviation, describing the average of a dataset (or group within the data)
The process of creating a database design commonly involves several tasks that will be carried out by the database designer (or data developer). Usually, the designer will perform the following:
- Identify what data will be kept in the database.
- Establish the relationships between the different data points.
- Create a logical data structure to be used on the basis of steps 1 and 2.
Even during the act of application program designing, a thorough understanding of how the data works is essential. Without understanding average or default values, relationships between data points and grouping, and so on, the created application is at risk of failing.
One idea for applying statistical thinking to help with data designing is in the case where there is limited real data available. If enough data cannot be collected, one could create sample (test) data by a variety of sampling methods, such as probability sampling.
Note
A probability-based sample is created by constructing a list of the target population values, called a sample frame, then a randomized process for selecting records from the sample frame, which is called a selection procedure. Think of this as creating a script to generate records of sample data based on your knowledge of actual data as well as some statistical logic to be used for testing your designs.
Finally, approach any problem with scientific or statistical methods, and odds are you'll produce better results.
In my experience, most data developers tend to lock on to a technology or tool based upon a variety of factors (some of which we mentioned earlier in this chapter) becoming increasingly familiar with and (hopefully) more proficient with the product, tool, or technology—even the continuously released newer versions. One might suspect that (and probably would be correct) the more the developer uses the tool, the higher the skill level that he or she establishes. Data scientists, however, seem to lock onto methodologies, practices, or concepts more than the actual tools and technologies they use to implement them.
This turning of focus (from to tool to technique) changes one's mindset to the idea of thinking what tool best serves my objective rather than how this tool serves my objective.
Note
The more tools you are exposed to, the broader your thinking will become a developer or data scientist. The open source community provides outstanding tools you can download, learn, and use freely. One should adopt a mindset of what's next or new to learn, even if it's in an attempt to compare features and functions of a new tool to your preferred tool. We'll talk more about this in the perpetual learning section of this chapter.
An exciting example of a currently popular data developer or data enabling tool is MarkLogic (http://www.marklogic.com/). This is an operational and transactional enterprise NoSQL database that is designed to integrate, store, manage, and search more data than ever before. MarkLogic received the 2017 DAVIES Award for best Data Development Tools. R and Python seem to be at the top as options for the data scientists.
Note
It would not be appropriate to end this section without the mention of IBM Watson Analytics (https://www.ibm.com/watson/), currently transforming the way the industry thinks about statistical or cognitive thinking.
Data science is clearly an ever-evolving field, with exponentially growing popularity. In fact, I'd guess that if you ask a dozen professionals, you'll most likely receive a dozen different definitions of what a data scientist is (and their place within a project or organization), but most likely, all would agree with their level of importance and that vast numbers of opportunities exist within the industry and the world today.
Data scientist face an unprecedented demand for more models, more insights...there's only one way to do that: They have to dramatically speed up the insights to action. In the future data Scientists, must become more productive. That's the only way they're going to get more value from the data. -Gualtieri https://www.datanami.com/2015/09/18/the-future-of-data-science/
Data Scientist is relatively hard to find today. If you do your research, you will find that today's data scientists may have a mixed background consisting of mathematics, programming, and software design, experimental design, engineering, communication, and management skills. In practice, you'll see that most data scientists you find aren't specialists in any one aspect, rather they possess varying levels of proficiency in several areas or backgrounds.
The role of the data scientist has unequivocally evolved since the field of statistics of over 1200 years ago. Despite the term only existing since the turn of this century, it has already been labeled The Sexiest Job of the 21st Century, which understandably, has created a queue of applicants stretched around the block -Pearson https://www.linkedin.com/pulse/evolution-data-scientist-chris-pearson
The idea of continued assessment or perpetual learning is an important statistical concept to grasp. Consider learning enhanced skills of perception as a common definition. For example, in statistics, we can refer to the idea of cross-validation. This is a statistical approach for measuring (assessing) a statistical model's performance. This practice involves identifying a set of validation values and then running a model a set number of rounds (continuously), using sample datasets and then averaging the results of each round to ultimately see how good a model (or approach) might be in solving a particular problem or meeting an objective.
The expectation here is that given performance results, adjustments could be made to tweak the model so as to provide the ability to identify insights when used with a real or full population of data. Not only is this concept a practice the data developer should use for refining or fine-tuning a data design or data-driven application process, but this is great life advice in the form of try, learn, adjust, and repeat.
Note
The idea of model assessment is not unique to statistics. Data developers might consider this similar to the act of predicting SQL performance or perhaps the practice of an application walkthrough where an application is validated against the intent and purpose stated within its documented requirements.
Predictive modeling uses the statistics of data science to predict or foresee a result (actually, a probable result). This may sound a lot like fortune telling, but it is more about putting to use cognitive reasoning to interpret information (mined from data) to draw a conclusion. In the way that a scientist might be described as someone who acts in a methodical way, attempting to obtain knowledge or to learn, a data scientist might be thought of as trying to make predictions, using statistics and (machine) learning.
Note
When we talk about predicting a result, it's really all about the probability of seeing a certain result. Probability deals with predicting the likelihood of future events, while statistics involves the analysis of the frequency of past events.
If you are a data developer who has perhaps worked on projects serving an organization's office of finance, you may understand why a business leader would find it of value to not just report on its financial results (even the most accurate of results are really still historical events) but also to be able to make educated assumptions on future performance.
Perhaps you can understand that if you have a background in and are responsible for financial reporting, you can now take the step towards providing statistical predictions to those reports!