R Programming By Example

By : Omar Trejo Navarro

R Programming By Example

By: Omar Trejo Navarro

Overview of this book

R is a high-level statistical language and is widely used among statisticians and data miners to develop analytical applications. Often, data analysis people with great analytical skills lack solid programming knowledge and are unfamiliar with the correct ways to use R. Based on the version 3.4, this book will help you develop strong fundamentals when working with R by taking you through a series of full representative examples, giving you a holistic view of R. We begin with the basic installation and configuration of the R environment. As you progress through the exercises, you'll become thoroughly acquainted with R's features and its packages. With this book, you will learn about the basic concepts of R programming, work efficiently with graphs, create publication-ready and interactive 3D graphs, and gain a better understanding of the data at hand. The detailed step-by-step instructions will enable you to get a clean set of data, produce good visualizations, and create reports for the results. It also teaches you various methods to perform code profiling and performance enhancement with good programming practices, delegation, and parallelization. By the end of this book, you will know how to efficiently work with data, create quality visualizations and reports, and develop code that is modular, expressive, and maintainable.

Preface

What this book covers

What you need for this book

Free Chapter

Introduction to R

What R is and what it isn't

Comparing R with other software

The interpreter and the console

Tools to work efficiently with R

How to use this book

Tracking state with symbols and variables

Working with data types and data structures

Divide and conquer with functions

Complex logic with control structures

The examples in this book

Summary

Understanding Votes with Descriptive Statistics

This chapter's required packages

The Brexit votes example

Cleaning and setting up the data

Summarizing the data into a data frame

Getting intuition with graphs and correlations

Creating a new dataset with what we've learned

Building new variables with principal components

Putting it all together into high-quality code

Summary

Predicting Votes with Linear Models

Required packages

Setting up the data

Predicting votes with linear models

Checking model assumptions

Measuring accuracy with score functions

Programatically finding the best model

Predicting votes from wards with unknown data

Summary

Simulating Sales Data and Working with Databases

Required packages

Designing our data tables

Simulating the sales data

Simulating the client data

Simulating the client messages data

Working with relational databases

Summary

Communicating Sales with Visualizations

Required packages

Extending our data with profit metrics

Building blocks for reusable high-quality graphs

Starting with simple applications for bar graphs

Graphing disaggregated data with boxplots

Scatter plots with joint and marginal distributions

Developing our own graph type – radar graphs

Exploring with interactive 3D scatter plots

Looking at dynamic data with time-series

Looking at geographical data with static maps

Navigating geographical data with interactive maps

Summary

Understanding Reviews with Text Analysis

This chapter's required packages

What is text analysis and how does it work?

Preparing, training, and testing data

Building the corpus with tokenization and data cleaning

Training models with cross validation

Improving our results with TF-IDF

Adding flexibility with N-grams

Reducing dimensionality with SVD

Extending our analysis with cosine similarity

Digging deeper with sentiment analysis

Testing our predictive model with unseen data

Retrieving text data from Twitter

Summary

Developing Automatic Presentations

Required packages

Why invest in automation?

Literate programming as a content creation methodology

The basic tools for an automation pipeline

A gentle introduction to Markdown

Header Level 1

Extending Markdown with R Markdown

Developing graphs and analysis as we normally would

Building our presentation with R Markdown

Summary

Object-Oriented System to Track Cryptocurrencies

This chapter's required packages

The cryptocurrencies example

A brief introduction to object-oriented programming

Introducing three object models in R – S3, S4, and R6

The architecture behind our cryptocurrencies system

Starting simple with timestamps using S3 classes

Implementing cryptocurrency assets using S4 classes

Implementing our storage layer with R6 classes

Retrieving live data for markets and wallets with R6 classes

Finally introducing users with S3 classes

Helping ourselves with a centralized settings file

Saving our initial user data into the system

Activating our system with two simple functions

Some advice when working with object-oriented systems

Summary

Implementing an Efficient Simple Moving Average

Required packages

Starting by using good algorithms

How fast is fast enough?

Calculating simple moving averages inefficiently

Understanding why R can be slow

Measuring by profiling and benchmarking

Easily achieving high benefit - cost improvements

Using parallelization to divide and conquer

Using C++ and Fortran to accelerate calculations

Looking back at what we have achieved

Other topics of interest to enhance performance

Summary

Adding Interactivity with Dashboards

Required packages

What is functional reactive programming and why is it useful?

Designing our high-level application structure

Inserting a dynamic data table

Introducing interactivity with user input

Adding a summary table with shared data

Adding a simple moving average graph

Adding interactivity with a secondary zoom-in graph

Styling our application with themes

What R is and what it isn't

When it comes to choosing software for statistical computing, it's tough to argue against R. Who could dislike a high quality, cross-platform, open source, statistical software product? It has an interactive console for exploratory work. It can run as a scripting language to replicate processes. It has a lot of statistical models built in, so you don't have to reinvent the wheel, but when the base toolset is not enough, you have access to a rich ecosystem of external packages. And, it's free! No wonder R has become a favorite in the age of data.

The inspiration for R – the S language

R was inspired by the S statistical language developed by John Chambers at AT&T. The name S is an allusion to another one-letter-name programming language also developed at AT&T, the famous C language. R was created by Ross Ihaka and Robert Gentleman in the Department of Statistics at the University of Auckland in 1991.

The general S philosophy sets the stage for the design of the R language itself, which many programmers coming from other programming languages find somewhat odd and confusing. In particular, it's important to realize that S was developed to make data analysis as easy as possible.

"We wanted users to be able to begin in an interactive environment, where they did not consciously think of programming. Then as their needs became clearer and their sophistication increased, they should be able to slide gradually into programming, when the language and system aspects would become more important."

– John Chambers

The key part here is the transition from analyst to developer. They wanted to build a language that could easily service both types of users. They wanted to build language that would be suitable for interactive data analysis through a command line but which could also be used to program complex systems, like traditional programming languages.

It's no coincidence that this book is structured that way. We will start doing data analysis first, and we will gradually move toward developing a full and complex system for information retrieval with a web application on top.

R is a high quality statistical computing system

R is comparable, and often superior, to commercial products when it comes to programming capabilities, complex systems development, graphic production, and community ecosystems. Researchers in statistics and machine learning, as well as many other data-related disciplines, will often publish R packages to accompany their publications. This translates into immediate public access to the very latest statistical techniques and implementations. Whatever model or graphic you're trying to develop, chances are that someone has already tried it, and if not, you can at least learn from their efforts.

R is a flexible programming language

As we have seen, in addition to providing statistical tools, R is a general-purpose programming language. You can use R to extend its own functionality, automate processes that make use of complex systems, and many other things. It incorporates features from other object-oriented programming languages and has strong foundations for functional programming, which is well suited for solving many of the challenges of data analysis. R allows the user to write powerful, concise, and descriptive code.

R is free, as in freedom and as in free beer

In many ways, a language is successful inasmuch as it creates a platform with which many people can create new things, and R has proven to be very successful in this regard. One key limitation of the S language was that it was only available in a commercial package, but R is free software. Free as in freedom, and free as in free beer.

The copyright for the primary source code for R is held by the R Foundation and is published under General Public License (GPL). According to the Free Software Foundation (http://www.fsf.org/), with free software (free as in freedom) you are granted the following four freedoms:

Freedom 0: Run the program for any purpose
Freedom 1: Study how the program works and adapt it to your needs
Freedom 2: Redistribute copies so you can help your neighbor
Freedom 3: Improve the program and release your improvements to the public

These freedoms have allowed R to develop strong prolific communities that include world-class statisticians and programmers as well as many volunteers, who help improve and extend the language. They also allow for R to be developed and maintained for all popular operating systems, and to be easily used by individuals and organizations who wish to do so, possibly sharing their findings in a way that others can replicate their results. Such is the power of free software.

What R is not good for

No programming language or system is perfect. R certainly has a number of drawbacks, the most common being that it can be painfully slow (when not used correctly). Keep in mind that R is essentially based on 40-year-old technology, going back to the original S system developed at Bell Labs. Therefore, several of its imperfections come from the fact that it was not built in anticipation for the data age we live in now. When R was born, disk and RAM were very expensive and the internet was just getting started. Notions of large-scale data analysis and high-performance computing were rare.

Fast-forward to the present, hardware cost is just a fraction of what it used to be, computing power is available online for pennies, and everyone is interested in collecting and analyzing data at large scale. This surge in data analysis has brought to the forefront two of R's fundamental limitations, the fact that it's single-threaded and memory-bound. These two characteristics drastically slow it down. Furthermore, R is an interpreted dynamically typed language, which can make it even slower. And finally, R has object immutability and various ways to implement object-oriented programming, both of which can make it hard for people, specially those coming from other languages, to write high-quality code if they don't know how to deal with them. You should know that all of the characteristics mentioned in this paragraph are addressed in Chapter 9, Implementing an Efficient Simple Moving Average.

A double-edged sword in R, is that most of its users do not think of themselves as programmers, and are more concerned with results than with process (which is not necessarily a bad thing). This means that much of the R code you can find online is written without regard for elegance, speed, or readability, since most R users do not revise their code to address these shortcomings. This permeates into code that is patchy and not rigorously tested, which in turn produces many edge cases that you must take into account when using low-quality packages. You will do well to keep this in mind.

R Programming By Example

By : Omar Trejo Navarro

R Programming By Example

By: Omar Trejo Navarro

Overview of this book

Related Content you might be interested in

Current Title:

R Programming By Example

Web Application Development with R Using Shiny

Mastering Machine Learning with R

R Data Analysis Cookbook