Book Image

Hands-On Data Science with R

By : Vitor Bianchi Lanzetta, Doug Ortiz, Nataraj Dasgupta, Ricardo Anjoleto Farias
Book Image

Hands-On Data Science with R

By: Vitor Bianchi Lanzetta, Doug Ortiz, Nataraj Dasgupta, Ricardo Anjoleto Farias

Overview of this book

R is the most widely used programming language, and when used in association with data science, this powerful combination will solve the complexities involved with unstructured datasets in the real world. This book covers the entire data science ecosystem for aspiring data scientists, right from zero to a level where you are confident enough to get hands-on with real-world data science problems. The book starts with an introduction to data science and introduces readers to popular R libraries for executing data science routine tasks. This book covers all the important processes in data science such as data gathering, cleaning data, and then uncovering patterns from it. You will explore algorithms such as machine learning algorithms, predictive analytical models, and finally deep learning algorithms. You will learn to run the most powerful visualization packages available in R so as to ensure that you can easily derive insights from your data. Towards the end, you will also learn how to integrate R with Spark and Hadoop and perform large-scale data analytics without much complexity.
Table of Contents (16 chapters)

Active domains of data science

Data science plays a role in virtually all aspects of our day-to-day lives and is used across nearly all industries. The adoption of data science was largely spurred by the successes of start-ups such as Uber, Airbnb, and Facebook that rose rapidly and earned valuations of billions of dollars in a very short span of time.

Data generated by social media networks such as Facebook and Twitter, search engines such as Google and Yahoo!, and various other networks, such as Pinterest and Instagram led to a deluge of information about personal tastes, preferences, and habits of individuals. Companies leveraged the information using various machine learning techniques to gain insights.

For example, Natural Language Processing (NLP) is a machine learning technique used to analyse textual data on comments posted on public forums to extract users' interests. The users are then shown ads relevant to their interests generating sales from which companies earn ad revenue. Image recognition algorithms are utilized to automatically identify objects in an image and serve the relevant images when users search for those objects on search engines.

The use of data science as a means to not only increase user engagement but also increase revenue, has become a widespread phenomenon. Some of the domains in which data science is prevalent is given as follows. The list is not all-inclusive, but highlights some of the key industries in which data science plays an important role today:

A few of these domains have been discussed in the following sections.

Finance

Data science has been used in finance, especially in trading for many decades. Investment banks, especially trading desks, have employed complex models to analyse and make trading decisions. Some examples of data science as used in finance include:

  • Credit risk management: Analyse the creditworthiness of a user by analyzing the historical financial records, assets, and transactions of the user
  • Loan fraud: Identifying applications for credit or loans that may be fraudulent by analyzing the loan and applicant's characteristics
  • Market Basket Analysis: Understanding the correlation among stocks and other securities and formulating trading and hedging strategies
  • High-frequency trading: Analyzing trades and quotes to discover pricing inefficiencies and arbitrage opportunities

Healthcare

Healthcare and related fields such as pharmaceuticals and life sciences, have also seen a gradual rise in the adoption and use of machine learning. A leading example has been IBM Watson. Developed in late 2000s, IBM Watson rose to popularity after it won the Double Jeopardy, a popular quiz contest in the US in 2011. Today, IBM Watson is being used for clinical research and several institutions have published preliminary results of success. (Source: http://www.ascopost.com/issues/june-25-2017/how-watson-for-oncology-is-advancing-personalized-patient-care/). The primary impediment to wider adoption has been the extremely high cost of using the system with usually an uncertain return on investment. Companies that are generally well capitalized can invest in the technology.

More common uses of data science in healthcare include:

  • Epidemiology: Preventing the spread of diseases and other epidemiology related use cases are being solved with various machine learning techniques. A recent example of the use of clustering to detect the Ebola outbreak received attention, being one of the first times that machine learning was used in a medical use case very effectively. (Source: https://spectrum.ieee.org/tech-talk/biomedical/diagnostics/healthmap-algorithm-ebola-outbreak).
  • Health insurance fraud detection: The health insurance industry loses billions each year in the US due to fraudulent claims for insurance. Machine learning, and more generally, data science is being used to detect cases of fraud and reduce the loss incurred by leading health insurance firms. (Source: https://www.sciencedirect.com/science/article/pii/S1877042812036099).
  • Recommender engines: Algorithms that match patients with physicians are used to provide recommendations based on the patients' symptoms and doctor specialties.
  • Image recognition: Arguably, the most common use of data science in healthcare, image recognition algorithms are used for a variety of cases ranging from segmentation of malignant and non-malignant tumours to cell segmentation. (Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3159221/).

Pharmaceuticals

Although closely linked to the data science use cases in healthcare, data science use cases in pharma are geared toward the development of drugs, physician marketing, and treatment-related analysis. Examples of data science in pharma include the following:

Government

Data science is used by state and national governments for a wide range of uses. These include topics in cyber security, voter benefits, climate change, social causes, and other similar use cases that are geared toward public policy and public benefits.

Some examples include the following:

  • Climate change: One of the most popular topics among climate change proponents, there is extensive machine learning related work that is being conducted around the globe to detect and understand the causes of climate change. (Source: https://toolkit.climate.gov).
  • Cyber security: The use of extremely advanced machine learning techniques for national cyber security is evident and well known all over the world, ever since such practices were disclosed by consultants at security firms a few years back. Security-related organizations employ some of the most advanced hardware and software stacks for detecting cyber threats and prevent hacking attempts. (Source: https://www.csoonline.com/article/2942083/big-data-security/cybersecurity-is-the-killer-app-for-big-data-analytics.html).
  • Social causes: The use of data science for a wide range of use cases geared toward social good is well known due to several conferences and papers that have been organized and released respectively on the topic. Examples include topics in urban analytics, power grids utilizing smart meters, criminal justice. (Source: https://dssg.uchicago.edu/data-science-for-social-good-conference-2017/agenda/).

Manufacturing and retail

The manufacturing and retail industry has used data science to designing better products, optimize pricing, and design strategic marketing techniques. Some examples include the following:

Web industry

One of the earliest beneficiaries of data science was the web industry. Empowered by the collection of user-specific data from social networks, firms around the world employ algorithms to understand user behavior and generate targeted ads. Google, one of the earliest proponents of targeted ad marketing today, earns most of its revenue from ads, more than $95 billion in 2017. (Source: https://www.statista.com/statistics/266249/advertising-revenue-of-google/). The use of data science for web-related businesses is ubiquitous today and companies such as Uber, Airbnb, Netflix, and Amazon have successfully navigated and made full use of this complex ecosystem, generating not only huge profits but also added millions of new jobs directly or indirectly as a result.

  • Targeted ads: Click through ads have been one of the prime areas of machine learning. By reading cookies saved on users' computers from various sites, other sites can assess the users interests and accordingly decide which ads to serve when they visit new sites. As per online sources, the value of internet advertising is over $1 trillion and has generated over 10 million jobs in 2017 alone. (Source: https://www.iab.com/insights/economic-value-advertising-supported-internet-ecosystem/).
  • Recommender engines: Netflix, Pandora, and other movies and audio streaming services utilize recommender engines to understand which movies or music the viewer or listener would be interested in and make recommendations. The recommendations are often based on what other users with similar tastes might have already seen and leverage recommender algorithms such as collaborative, content-based, and hybrid filtering.
  • Web design: Using A/B testing, mouse tracking, and other sophisticated techniques, web developers leverage data science to design better web pages such as landing pages and in general websites. A/B testing for instance allows developers to decide between different versions of the same web page and deploy accordingly.

Other industries

There are various other industries today that benefit from data science and as such, it has become so common that it would be impractical to list all, but at a high level, some of the others include the following:

  • Oil and natural gas for oil production
  • Meteorology for understanding weather patterns
  • Space research for detecting and/or analyzing stars and galaxies
  • Utilities for energy production and energy savings
  • Biotechnology for research and finding new cures for diseases

In general, since data science, or machine learning algorithms are not specific to any particular industry, it is entirely possible to apply algorithms to creative use cases and derive business benefits.