Sign In Start Free Trial
Account

Add to playlist

Create a Playlist

Modal Close icon
You need to login to use this feature.
  • Book Overview & Buying R for Data Science
  • Table Of Contents Toc
R for Data Science

R for Data Science

By : Toomey
3.4 (5)
close
close
R for Data Science

R for Data Science

3.4 (5)
By: Toomey

Overview of this book

If you are a data analyst who has a firm grip on some advanced data analysis techniques and wants to learn how to leverage the features of R, this is the book for you. You should have some basic knowledge of the R language and should know about some data science topics.
Table of Contents (14 chapters)
close
close

Anomaly detection

We can use R programming to detect anomalies in a dataset. Anomaly detection can be used in a number of different areas, such as intrusion detection, fraud detection, system health, and so on. In R programming, these are called outliers. R programming allows the detection of outliers in a number of ways, as listed here:

  • Statistical tests
  • Depth-based approaches
  • Deviation-based approaches
  • Distance-based approaches
  • Density-based approaches
  • High-dimensional approaches

Show outliers

R programming has a function to display outliers: identify (in boxplot).

The boxplot function produces a box-and-whisker plot (see following graph). The boxplot function has a number of graphics options. For this example, we do not need to set any.

The identify function is a convenient method for marking points in a scatter plot. In R programming, box plot is a type of scatter plot.

Example

In this example, we need to generate a 100 random numbers and then plot the points in boxes.

Then, we mark the first outlier with it's identifier as follows:

> y <- rnorm(100)
> boxplot(y)
> identify(rep(1, length(y)), y, labels = seq_along(y))
Example

Note

Notice the 0 next to the outlier in the graph.

Example

The boxplot function automatically computes the outliers for a set as well.

First, we will generate a 100 random numbers as follows (note that this data is randomly generated, so your results may not be the same):

> x <- rnorm(100)

We can have a look at the summary information on the set using the following code:

> summary(x)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-2.12000 -0.74790 -0.20060 -0.01711  0.49930  2.43200

Now, we can display the outliers using the following code:

> boxplot.stats(x)$out
[1] 2.420850 2.432033

The following code will graph the set and highlight the outliers:

> boxplot(x)
Example

Note

Notice the 0 next to the outlier in the graph.

We can generate a box plot of more familiar data showing the same issue with outliers using the built-in data for cars, as follows:

boxplot(mpg~cyl,data=mtcars, xlab="Cylinders", ylab="MPG")
Example

Another anomaly detection example

We can also use box plot's outlier detection when we have two dimensions. Note that we are forcing the issue by using a union of the outliers in x and y rather than an intersection. The point of the example is to display such points. The code is as follows:

> x <- rnorm(1000)
> y <- rnorm(1000)
> f <- data.frame(x,y)
> a <- boxplot.stats(x)$out
> b <- boxplot.stats(y)$out
> list <- union(a,b)
> plot(f)
> px <- f[f$x %in% a,]
> py <- f[f$y %in% b,]
> p <- rbind(px,py)
> par(new=TRUE)
> plot(p$x, p$y,cex=2,col=2)
Another anomaly detection example

While R did what we asked, the plot does not look right. We completely fabricated the data; in a real use case, you would need to use your domain expertise to determine whether these outliers were correct or not.

Calculating anomalies

Given the variety of what constitutes an anomaly, R programming has a mechanism that gives you complete control over it: write your own function that can be used to make a decision.

Usage

We can use the name function to create our own anomaly as shown here:

name <- function(parameters,…) {
  # determine what constitutes an anomaly
  return(df)
}

Here, the parameters are the values we need to use in the function. I am assuming we return a data frame from the function. The function could do anything.

Example 1

We will be using the iris data in this example, as shown here:

> data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

If we decide an anomaly is present when sepal is under 4.5 or over 7.5, we could use a function as shown here:

> outliers <- function(data, low, high) {
>  outs <- subset(data, data$X5.1 < low | data$X5.1 > high)
>  return(outs)
>}

Then, we will get the following output:

> outliers(data, 4.5, 7.5)
    X5.1 X3.5 X1.4 X0.2    Iris.setosa
8    4.4  2.9  1.4  0.2    Iris-setosa
13   4.3  3.0  1.1  0.1    Iris-setosa
38   4.4  3.0  1.3  0.2    Iris-setosa
42   4.4  3.2  1.3  0.2    Iris-setosa
105  7.6  3.0  6.6  2.1 Iris-virginica
117  7.7  3.8  6.7  2.2 Iris-virginica
118  7.7  2.6  6.9  2.3 Iris-virginica
122  7.7  2.8  6.7  2.0 Iris-virginica
131  7.9  3.8  6.4  2.0 Iris-virginica
135  7.7  3.0  6.1  2.3 Iris-virginica

This gives us the flexibility of making slight adjustments to our criteria by passing different parameter values to the function in order to achieve the desired results.

Example 2

Another popular package is DMwR. It contains the lofactor function that can also be used to locate outliers. The DMwR package can be installed using the following command:

> install.packages("DMwR")
> library(DMwR)

We need to remove the species column from the data, as it is categorical against it data. This can be done by using the following command:

> nospecies <- data[,1:4]

Now, we determine the outliers in the frame:

> scores <- lofactor(nospecies, k=3)

Next, we take a look at their distribution:

> plot(density(scores))
Example 2

One point of interest is if there is some close equality amongst several of the outliers (that is, density of about 4).

CONTINUE READING
83
Tech Concepts
36
Programming languages
73
Tech Tools
Icon Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.
Icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Icon 50+ new titles added per month and exclusive early access to books as they are being written.
R for Data Science
notes
bookmark Notes and Bookmarks search Search in title playlist Add to playlist font-size Font size

Change the font size

margin-width Margin width

Change margin width

day-mode Day/Sepia/Night Modes

Change background colour

Close icon Search
Country selected

Close icon Your notes and bookmarks

Confirmation

Modal Close icon
claim successful

Buy this book with your credits?

Modal Close icon
Are you sure you want to buy this book with one of your credits?
Close
YES, BUY

Submit Your Feedback

Modal Close icon
Modal Close icon
Modal Close icon