Book Image

Machine Learning with R - Third Edition

By : Brett Lantz
Book Image

Machine Learning with R - Third Edition

By: Brett Lantz

Overview of this book

Machine learning, at its core, is concerned with transforming data into actionable knowledge. R offers a powerful set of machine learning methods to quickly and easily gain insight from your data. Machine Learning with R, Third Edition provides a hands-on, readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, Brett Lantz teaches you everything you need to uncover key insights, make new predictions, and visualize your findings. This new 3rd edition updates the classic R data science book to R 3.6 with newer and better libraries, advice on ethical and bias issues in machine learning, and an introduction to deep learning. Find powerful new insights in your data; discover machine learning with R.
Table of Contents (18 chapters)
Machine Learning with R - Third Edition
Contributors
Preface
Other Books You May Enjoy
Leave a review - let other readers know what you think
Index

Uses and abuses of machine learning


Most people have heard of Deep Blue, the chess-playing computer that in 1997 was the first to win a game against a world champion. Another famous computer, Watson, defeated two human opponents on the television trivia game show Jeopardy in 2011. Based on these stunning accomplishments, some have speculated that computer intelligence will replace workers in information technology occupations, just as machines replaced workers in fields and assembly lines.

The truth is that even as machines reach such impressive milestones, they are still relatively limited in their ability to thoroughly understand a problem. They are pure intellectual horsepower without direction. A computer may be more capable than a human of finding subtle patterns in large databases, but it still needs a human to motivate the analysis and turn the result into meaningful action.

Note

Without completely discounting the achievements of Deep Blue and Watson, it is important to note that neither is even as intelligent as a typical five-year-old. For more on why "comparing smarts is a slippery business," see the Popular Science article FYI: Which Computer Is Smarter, Watson Or Deep Blue?, by Will Grunewald, 2012: https://www.popsci.com/science/article/2012-12/fyi-which-computer-smarter-watson-or-deep-blue.

Machines are not good at asking questions, or even knowing what questions to ask. They are much better at answering them, provided the question is stated in a way that the computer can comprehend. Present-day machine learning algorithms partner with people much like a bloodhound partners with its trainer: the dog's sense of smell may be many times stronger than its master's, but without being carefully directed, the hound may end up chasing its tail.

Figure 1.2: Machine learning algorithms are powerful tools that require careful direction

To better understand the real-world applications of machine learning, we'll now consider some cases where it has been used successfully, some places where it still has room for improvement, and some situations where it may do more harm than good.

Machine learning successes

Machine learning is most successful when it augments, rather than replaces, the specialized knowledge of a subject-matter expert. It works with medical doctors at the forefront of the fight to eradicate cancer; assists engineers and programmers with efforts to create smarter homes and automobiles; and helps social scientists to build knowledge of how societies function. Toward these ends, it is employed in countless businesses, scientific laboratories, hospitals, and governmental organizations. Any effort that generates or aggregates data likely employs at least one machine learning algorithm to help make sense of it.

Though it is impossible to list every use case for machine learning, a look at recent success stories identifies several prominent examples:

  • Identification of unwanted spam messages in email

  • Segmentation of customer behavior for targeted advertising

  • Forecasts of weather behavior and long-term climate changes

  • Reduction of fraudulent credit card transactions

  • Actuarial estimates of financial damage of storms and natural disasters

  • Prediction of popular election outcomes

  • Development of algorithms for auto-piloting drones and self-driving cars

  • Optimization of energy use in homes and office buildings

  • Projection of areas where criminal activity is most likely

  • Discovery of genetic sequences linked to diseases

By the end of this book, you will understand the basic machine learning algorithms that are employed to teach computers to perform these tasks. For now, it suffices to say that no matter what the context is, the machine learning process is the same. Regardless of the task, an algorithm takes data and identifies patterns that form the basis for further action.

The limits of machine learning

Although machine learning is used widely and has tremendous potential, it is important to understand its limits. Machine learning, at this time, emulates a relatively limited subset of the capabilities of the human brain. It offers little flexibility to extrapolate outside of strict parameters and knows no common sense. With this in mind, one should be extremely careful to recognize exactly what an algorithm has learned before setting it loose in the real world.

Without a lifetime of past experiences to build upon, computers are also limited in their ability to make simple inferences about logical next steps. Take, for instance, the banner advertisements seen on many websites. These are served according to patterns learned by data mining the browsing history of millions of users. Based on this data, someone who views websites selling shoes is interested in buying shoes and should therefore see advertisements for shoes. The problem is that this becomes a never-ending cycle in which, even after shoes have been purchased, additional shoe advertisements are served, rather than advertisements for shoelaces and shoe polish.

Many people are familiar with the deficiencies of machine learning's ability to understand or translate language, or to recognize speech and handwriting. Perhaps the earliest example of this type of failure is in a 1994 episode of the television show The Simpsons, which showed a parody of the Apple Newton tablet. For its time, the Newton was known for its state-of-the-art handwriting recognition. Unfortunately for Apple, it would occasionally fail to great effect. The television episode illustrated this through a sequence in which a bully's note to "Beat up Martin" was misinterpreted by the Newton as "Eat up Martha."

Figure 1.3: Screen captures from Lisa on Ice, The Simpsons, 20th Century Fox (1994)

Machine language processing has improved enough in the time since the Apple Newton that Google, Apple, and Microsoft are all confident in their ability to offer voice-activated virtual concierge services such as Google Assistant, Siri, and Cortana. Still, these services routinely struggle to answer relatively simple questions. Furthermore, online translation services sometimes misinterpret sentences that a toddler would readily understand, and the predictive text feature on many devices has led to a number of humorous "autocorrect fail" sites that illustrate computers' ability to understand basic language but completely misunderstand context.

Some of these mistakes are surely to be expected. Language is complicated, with multiple layers of text and subtext, and even human beings sometimes misunderstand context. In spite of the fact that machine learning is rapidly improving at language processing, the consistent shortcomings illustrate the important fact that machine learning is only as good as the data it has learned from. If context is not explicit in the input data, then just like a human, the computer will have to make its best guess from its limited set of past experiences.

Machine learning ethics

At its core, machine learning is simply a tool that assists us with making sense of the world's complex data. Like any tool, it can be used for good or for evil. Where machine learning goes most wrong is when it is applied so broadly, or so callously, that humans are treated as lab rats, automata, or mindless consumers. A process that may seem harmless can lead to unintended consequences when automated by an emotionless computer. For this reason, those using machine learning or data mining would be remiss not to at least briefly consider the ethical implications of the art.

Due to the relative youth of machine learning as a discipline and the speed at which it is progressing, the associated legal issues and social norms are often quite uncertain, and constantly in flux. Caution should be exercised when obtaining or analyzing data in order to avoid breaking laws; violating terms of service or data use agreements; or abusing the trust or violating the privacy of customers or the public.

Note

The informal corporate motto of Google, an organization that collects perhaps more data on individuals than any other, was at one time, "don't be evil." While this seems clear enough, it may not be sufficient. A better approach may be to follow the Hippocratic Oath, a medical principle that states, "above all, do no harm."

Retailers routinely use machine learning for advertising, targeted promotions, inventory management, or the layout of the items in a store. Many have equipped checkout lanes with devices that print coupons for promotions based on a customer's buying history. In exchange for a bit of personal data, the customer receives discounts on the specific products he or she wants to buy. At first, this appears relatively harmless, but consider what happens when this practice is taken a bit further.

One possibly apocryphal tale concerns a large retailer in the United States that employed machine learning to identify expectant mothers for coupon mailings. The retailer hoped that if these mothers-to-be received substantial discounts, they would become loyal customers who would later purchase profitable items such as diapers, baby formula, and toys.

Equipped with machine learning methods, the retailer identified items in the customer purchase history that could be used to predict with a high degree of certainty not only whether a woman was pregnant, but also the approximate timing for when the baby was due.

After the retailer used this data for a promotional mailing, an angry man contacted the chain and demanded to know why his daughter received coupons for maternity items. He was furious that the retailer seemed to be encouraging teenage pregnancy! As the story goes, when the retail chain called to offer an apology, it was the father who ultimately apologized after confronting his daughter and discovering that she was indeed pregnant!

Whether completely true or not, the lesson learned from the preceding tale is that common sense should be applied before blindly applying the results of a machine learning analysis. This is particularly true in cases where sensitive information, such as health data, is concerned. With a bit more care, the retailer could have foreseen this scenario and used greater discretion when choosing how to reveal the pattern its machine learning analysis had discovered.

Note

For more detail on how retailers use machine learning to identify pregnancies, see the New York Times Magazine article, titled How Companies Learn Your Secrets, by Charles Duhigg, 2012: https://www.nytimes.com/2012/02/19/magazine/shopping-habits.html.

As machine learning algorithms are more widely applied, we find that computers may learn some unfortunate behaviors of human societies. Sadly, this includes perpetuating race or gender discrimination and reinforcing negative stereotypes. For example, researchers have found that Google's online advertising service is more likely to show ads for high-paying jobs to men than women, and is more likely to display ads for criminal background checks to black people than white people.

Proving that these types of missteps are not limited to Silicon Valley, a Twitter chatbot service developed by Microsoft was quickly taken offline after it began spreading Nazi and anti-feminist propaganda. Often, algorithms that at first seem "content neutral" quickly start to reflect majority beliefs or dominant ideologies. An algorithm created by Beauty.AI to reflect an objective conception of human beauty sparked controversy when it favored almost exclusively white people. Imagine the consequences if this had been applied to facial recognition software for criminal activity!

Note

For more information about the real-world consequences of machine learning and discrimination see the New York Times article When Algorithms Discriminate, by Claire Cain Miller, 2015: https://www.nytimes.com/2015/07/10/upshot/when-algorithms-discriminate.html.

To limit the ability of algorithms to discriminate illegally, certain jurisdictions have well-intentioned laws that prevent the use of racial, ethnic, religious, or other protected class data for business reasons. However, excluding this data from a project may not be enough because machine learning algorithms can still inadvertently learn to discriminate. If a certain segment of people tends to live in a certain region, buys a certain product, or otherwise behaves in a way that uniquely identifies them as a group, machine learning algorithms can infer the protected information from other factors. In such cases, you may need to completely de-identify these people by excluding any potentially identifying data in addition to the already-protected statuses.

Apart from the legal consequences, inappropriate use of data may hurt the bottom line. Customers may feel uncomfortable or become spooked if aspects of their lives they consider private are made public. In recent years, a number of high-profile web applications have experienced a mass exodus of users who felt exploited when the applications' terms of service agreements changed or their data was used for purposes beyond what the users had originally intended. The fact that privacy expectations differ by context, by age cohort, and by locale adds complexity to deciding the appropriate use of personal data. It would be wise to consider the cultural implications of your work before you begin on your project, in addition to being aware of ever-more-restrictive regulations such as the European Union's newly-implemented General Data Protection Regulation (GDPR) and the inevitable policies that will follow in its footsteps.

Note

The fact that you can use data for a particular end does not always mean that you should.

Finally, it is important to note that as machine learning algorithms become progressively more important to our everyday lives, there are greater incentives for nefarious actors to work to exploit them. Sometimes, attackers simply want to disrupt algorithms for laughs or notoriety—such as "Google bombing," the crowd-sourced method of tricking Google's algorithms to highly rank a desired page.

Other times, the effects are more dramatic. A timely example of this is the recent wave of so-called fake news and election meddling, propagated via the manipulation of advertising and recommendation algorithms that target people according to their personality. To avoid giving such control to outsiders, when building machine learning systems, it is crucial to consider how they may be influenced by a determined individual or crowd.

Note

Social media scholar danah boyd (styled lowercase) presented a keynote at the Strata Data Conference 2017 in New York City that discussed the importance of hardening machine learning algorithms to attackers. For a recap, refer to: https://points.datasociety.net/your-data-is-being-manipulated-a7e31a83577b.

The consequences of malicious attacks on machine learning algorithms can also be deadly. Researchers have shown that by creating an "adversarial attack" that subtly distorts a street sign with carefully chosen graffiti, an attacker might cause an autonomous vehicle to misinterpret a stop sign, potentially resulting in a fatal crash. Even in the absence of ill intent, software bugs and human errors have already led to fatal accidents in autonomous vehicle technology from Uber and Tesla. With such examples in mind, it is of the utmost importance and ethical concern that machine learning practitioners should worry about how their algorithms will be used and abused in the real world.