We can use R programming to detect anomalies in a dataset. Anomaly detection can be used in a number of different areas, such as intrusion detection, fraud detection, system health, and so on. In R programming, these are called outliers. R programming allows the detection of outliers in a number of ways, as listed here:
Statistical tests
Depth-based approaches
Deviation-based approaches
Distance-based approaches
Density-based approaches
High-dimensional approaches
R programming has a function to display outliers: identify
(in boxplot
).
The boxplot
function produces a box-and-whisker plot (see following graph). The boxplot
function has a number of graphics options. For this example, we do not need to set any.
The identify
function is a convenient method for marking points in a scatter plot. In R programming, box plot is a type of scatter plot.
In this example, we need to generate a 100 random numbers and then plot the points in boxes.
Then, we mark the first outlier with it's identifier as follows:
> y <- rnorm(100) > boxplot(y) > identify(rep(1, length(y)), y, labels = seq_along(y))

The boxplot
function automatically computes the outliers for a set as well.
First, we will generate a 100 random numbers as follows (note that this data is randomly generated, so your results may not be the same):
> x <- rnorm(100)
We can have a look at the summary information on the set using the following code:
> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.12000 -0.74790 -0.20060 -0.01711 0.49930 2.43200
Now, we can display the outliers using the following code:
> boxplot.stats(x)$out [1] 2.420850 2.432033
The following code will graph the set and highlight the outliers:
> boxplot(x)

We can generate a box plot of more familiar data showing the same issue with outliers using the built-in data for cars, as follows:
boxplot(mpg~cyl,data=mtcars, xlab="Cylinders", ylab="MPG")

We can also use box plot's outlier detection when we have two dimensions. Note that we are forcing the issue by using a union of the outliers in x
and y
rather than an intersection. The point of the example is to display such points. The code is as follows:
> x <- rnorm(1000) > y <- rnorm(1000) > f <- data.frame(x,y) > a <- boxplot.stats(x)$out > b <- boxplot.stats(y)$out > list <- union(a,b) > plot(f) > px <- f[f$x %in% a,] > py <- f[f$y %in% b,] > p <- rbind(px,py) > par(new=TRUE) > plot(p$x, p$y,cex=2,col=2)

While R did what we asked, the plot does not look right. We completely fabricated the data; in a real use case, you would need to use your domain expertise to determine whether these outliers were correct or not.
Given the variety of what constitutes an anomaly, R programming has a mechanism that gives you complete control over it: write your own function that can be used to make a decision.
We can use the name
function to create our own anomaly as shown here:
name <- function(parameters,…) { # determine what constitutes an anomaly return(df) }
Here, the parameters are the values we need to use in the function. I am assuming we return a data frame from the function. The function could do anything.
We will be using the iris
data in this example, as shown here:
> data <- read.csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
If we decide an anomaly is present when sepal is under 4.5 or over 7.5, we could use a function as shown here:
> outliers <- function(data, low, high) { > outs <- subset(data, data$X5.1 < low | data$X5.1 > high) > return(outs) >}
Then, we will get the following output:
> outliers(data, 4.5, 7.5) X5.1 X3.5 X1.4 X0.2 Iris.setosa 8 4.4 2.9 1.4 0.2 Iris-setosa 13 4.3 3.0 1.1 0.1 Iris-setosa 38 4.4 3.0 1.3 0.2 Iris-setosa 42 4.4 3.2 1.3 0.2 Iris-setosa 105 7.6 3.0 6.6 2.1 Iris-virginica 117 7.7 3.8 6.7 2.2 Iris-virginica 118 7.7 2.6 6.9 2.3 Iris-virginica 122 7.7 2.8 6.7 2.0 Iris-virginica 131 7.9 3.8 6.4 2.0 Iris-virginica 135 7.7 3.0 6.1 2.3 Iris-virginica
This gives us the flexibility of making slight adjustments to our criteria by passing different parameter values to the function in order to achieve the desired results.
Another popular package is DMwR
. It contains the lofactor
function that can also be used to locate outliers. The DMwR
package can be installed using the following command:
> install.packages("DMwR") > library(DMwR)
We need to remove the species column from the data, as it is categorical against it data. This can be done by using the following command:
> nospecies <- data[,1:4]
Now, we determine the outliers in the frame:
> scores <- lofactor(nospecies, k=3)
Next, we take a look at their distribution:
> plot(density(scores))

One point of interest is if there is some close equality amongst several of the outliers (that is, density of about 4).