Book Image

Learning Social Media Analytics with R

By : Dipanjan Sarkar, Karthik Ganapathy, Raghav Bali, Tushar Sharma
Book Image

Learning Social Media Analytics with R

By: Dipanjan Sarkar, Karthik Ganapathy, Raghav Bali, Tushar Sharma

Overview of this book

The Internet has truly become humongous, especially with the rise of various forms of social media in the last decade, which give users a platform to express themselves and also communicate and collaborate with each other. This book will help the reader to understand the current social media landscape and to learn how analytics can be leveraged to derive insights from it. This data can be analyzed to gain valuable insights into the behavior and engagement of users, organizations, businesses, and brands. It will help readers frame business problems and solve them using social data. The book will also cover several practical real-world use cases on social media using R and its advanced packages to utilize data science methodologies such as sentiment analysis, topic modeling, text summarization, recommendation systems, social network analysis, classification, and clustering. This will enable readers to learn different hands-on approaches to obtain data from diverse social media sources such as Twitter and Facebook. It will also show readers how to establish detailed workflows to process, visualize, and analyze data to transform social data into actionable insights.
Table of Contents (16 chapters)
Learning Social Media Analytics with R
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Index

Data types


There are several basic data types in R for handling different types of data and values:

  • numeric: The numeric data type is used to store real or decimal vectors and is identical to the double data type.

  • double: This data type can store and represent double precision vectors.

  • integer: This data type is used for representing 32-bit integer vectors.

  • character: This data type is used to represent character vectors, where each element can be a string of type character

  • logical: The reserved words TRUE and FALSE are logical constants in the R language and T and F are global variables. All these four are logical type vectors.

  • complex: This data type is used to store and represent complex numbers

  • factor: This type is used to represent nominal or categorical variables by storing the nominal values in a vector of integers ranging from (1…n) such that n is the number of distinct values for the variable. A vector of character strings for the actual variable values is then mapped to this vector of integers

  • Miscellaneous: There are several other types including NA to denote missing values in data, NaN which denotes not a number, and ordered is used for factoring ordinal variables

Common functions for each data type include as and is, which are used for converting data types (typecasting) and checking the data type respectively.

For example, as.numeric(…) would typecast the data or vector indicated by the ellipses into numeric type and is.numeric(…) would check if the data is of numeric type.

Let us look at a few more examples for the various data types in the following code snippet to understand them better:

# typecasting and checking data types
> n <- c(3.5, 0.0, 1.7, 0.0)
> typeof(n)
[1] "double"
> is.numeric(n)
[1] TRUE
> is.double(n)
[1] TRUE
> is.integer(n)
[1] FALSE
> as.integer(n)
[1] 3 0 1 0
> as.logical(n)
[1]  TRUE FALSE  TRUE FALSE

# complex numbers
> comp <- 3 + 4i
> typeof(comp)
[1] "complex"

# factoring nominal variables
> size <- c(rep('large', 5), rep('small', 5), rep('medium', 3))
> size
 [1] "large"  "large"  "large"  "large"  "large"  "small"  "small"  "small"  "small"  "small" 
[11] "medium" "medium" "medium"
> size <- factor(size)
> size
 [1] large  large  large  large  large  small  small  small  small  small  medium medium medium
Levels: large medium small
> summary(size)
 large medium  small 
     5      3      5

The preceding examples should make the concepts clearer. Notice that non-zero numeric values are logically TRUE always, and zero values are FALSE, as we can see from typecasting our numeric vector to logical. We will now dive into the various data structures in R.

Data structures

The base R system has several important core data structures which are extensively used in handling, processing, manipulating, and analyzing data. We will be talking about five important data structures, which can be classified according to the type of data which can be stored and its dimensionality. The classification is depicted in the following table:

Content type

Dimensionality

Data structure

Homogeneous

One-dimensional

Vector

Homogeneous

N-dimensional

Array

Homogeneous

Two-dimensional

Matrix

Heterogeneous

One-dimensional

List

Heterogeneous

N-dimensional

DataFrame

The content type in the table depicts whether the data stored in the structure belongs to the same data type (homogeneous) or can contain data of different data types, (heterogeneous). The dimensionality of the data structure is pretty straightforward and is self-explanatory. We will now examine each data structure in further detail.

Vectors

The vector is the most basic data structure in R and here vectors indicate atomic vectors. They can be used to represent any data in R including input and output data. Vectors are usually created using the c(…) function, which is short for combine. Vectors can also be created in other ways such as using the : operator or the seq(…) family of functions. Vectors are homogeneous, all elements always belong to a single data type, and the vector by itself is a one-dimensional structure. The following snippet shows some vector representations:

> 1:5
[1] 1 2 3 4 5
> c(1,2,3,4,5)
[1] 1 2 3 4 5
> seq(1,5)
[1] 1 2 3 4 5
> seq_len(5)
[1] 1 2 3 4 5

You can also assign vectors to variables and perform different operations on them, including data manipulation, mathematical operations, transformations, and so on. We depict a few such examples in the following snippet:

# assigning two vectors to variables
> x <- 1:5
> y <- c(6,7,8,9,10)
> x
[1] 1 2 3 4 5
> y
[1]  6  7  8  9 10

# operating on vectors
> x + y
[1]  7  9 11 13 15
> sum(x)
[1] 15
> mean(x)
[1] 3
> x * y
[1]  6 14 24 36 50
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068

# indexing and slicing
> y[2:4]
[1] 7 8 9
> y[c(2,3,4)]
[1] 7 8 9

# naming vector elements
> names(x) <- c("one", "two", "three", "four", "five")
> x
  one   two three  four  five 
    1     2     3     4     5

The preceding snippet should give you a good flavor of what we can do with vectors. Try playing around with vectors and transforming and manipulating data with them!

Arrays

From the table of data structures we mentioned earlier, arrays can store homogeneous data and are N-dimensional data structures unlike vectors. Matrices are a special case of arrays with two dimensions, but more on that later. Considering arrays, it is difficult to represent data higher than two dimensions on the screen, but R can still handle it in a special way. The following example creates a three-dimensional array of 2x2x3:

# create a three-dimensional array
three.dim.array <- array(
    1:12,    # input data
    dim = c(2, 2, 3),   # dimensions
    dimnames = list(    # names of dimensions
        c("row1", "row2"),
        c("col1", "col2"),
        c("first.set", "second.set", "third.set")
    )
)

# view the array
> three.dim.array
, , first.set

     col1 col2
row1    1    3
row2    2    4

, , second.set

     col1 col2
row1    5    7
row2    6    8

, , third.set

     col1 col2
row1    9   11
row2   10   12

From the preceding output, you can see that R filled the data in the column-first order in the three-dimensional array. We will now look at matrices in the following section.

Matrices

We've briefly mentioned that matrices are a special case of arrays with two dimensions. These two dimensions are represented by properties in rows and columns. Just like we used the array(…) function in the previous section to create an array, we will be using the matrix(…) function to create matrices.

The following snippet creates a 4x3 matrix:

# create a matrix
mat <- matrix(
    1:12,   # data
    nrow = 4,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)

# view the matrix
> mat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12

Thus you can see from the preceding output that we have a 4x3 matrix with 4 rows and 3 columns and we filled in the data in row-wise fashion by using the by row parameter in the matrix(…) function.

The following snippet shows some mathematical operations with matrices which you should be familiar with:

# initialize matrices
m1 <- matrix(
    1:9,   # data
    nrow = 3,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)
m2 <- matrix(
    10:18,   # data
    nrow = 3,  # num of rows
    ncol = 3,  # num of columns
    byrow = TRUE  # fill the elements row-wise
)

# matrix addition
> m1 + m2
     [,1] [,2] [,3]
[1,]   11   13   15
[2,]   17   19   21
[3,]   23   25   27

# matrix transpose
> t(m1)
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

# matrix product
> m1 %*% m2
     [,1] [,2] [,3]
[1,]   84   90   96
[2,]  201  216  231
[3,]  318  342  366

We encourage you to try out more complex operations using matrices. See if you can find the inverse of a matrix.

Lists

Lists are a special type of vector besides the atomic vector which we discussed earlier. The difference with atomic vectors is that lists are heterogeneous and can hold different types of data such that each element of a list can itself be a list, an atomic vector, an array, matrix, or even a function. The following snippet shows us how to create lists:

# create sample list
list.sample <- list(
    nums = seq.int(1,5),
    languages = c("R", "Python", "Julia", "Java"),
    sin.func = sin
)

# view the list
> list.sample
$nums
[1] 1 2 3 4 5

$languages
[1] "R"      "Python" "Julia"  "Java"  

$sin.func
function (x)  .Primitive("sin")

# accessing individual list elements
> list.sample$languages
[1] "R"      "Python" "Julia"  "Java"  
> list.sample$sin.func(1.5708)
[1] 1

You can see from the preceding snippet that lists can hold different types of elements and accessing them is really easy.

Let us look at a few more operations with lists in the following snippet:

# initializing two lists
l1 <- list(nums = 1:5)
l2 <- list(
    languages = c("R", "Python", "Julia"),
    months = c("Jan", "Feb", "Mar")
)

# check lists and their type
> l1
$nums
[1] 1 2 3 4 5

> typeof(l1)
[1] "list"
> l2
$languages
[1] "R"      "Python" "Julia" 

$months
[1] "Jan" "Feb" "Mar"

> typeof(l2)
[1] "list"

# concatenating lists
> l3 <- c(l1, l2)
> l3
$nums
[1] 1 2 3 4 5

$languages
[1] "R"      "Python" "Julia" 

$months
[1] "Jan" "Feb" "Mar"

# converting list back to a vector
> v1 <- unlist(l1)
> v1
nums1 nums2 nums3 nums4 nums5 
    1     2     3     4     5 
> typeof(v1)
[1] "integer"

Now that we know how lists work, we will be moving on to the last and perhaps most widely used data structure in data processing and analysis, the DataFrame.

DataFrames

The DataFrame is a special data structure which is used to handle heterogeneous data with N-dimensions. This structure is used to handle data tables or tabular data having several observations, samples or data points which are represented by rows, and attributes for each sample, which are represented by columns. Each column can be thought of as a dimension to the dataset or a vector. It is very popular since it can easily work with tabular data, notably spreadsheets.

The following snippet shows us how we can create DataFrames and examine their properties:

# create data frame
df <- data.frame(
  name =  c("Wade", "Steve", "Slade", "Bruce"),
  age = c(28, 85, 55, 45),
  job = c("IT", "HR", "HR", "CS")
)

# view the data frame
> df
   name age job
1  Wade  28  IT
2 Steve  85  HR
3 Slade  55  HR
4 Bruce  45  CS

# examine data frame properties
> class(df)
[1] "data.frame"
> str(df)
'data.frame':      4 obs. of  3 variables:
 $ name: Factor w/ 4 levels "Bruce","Slade",..: 4 3 2 1
 $ age : num  28 85 55 45
 $ job : Factor w/ 3 levels "CS","HR","IT": 3 2 2 1
> rownames(df)
[1] "1" "2" "3" "4"
> colnames(df)
[1] "name" "age" "job"
> dim(df)
[1] 4 3

You can see from the preceding snippet how DataFrames can represent tabular data where each attribute is a dimension or column. You can also perform multiple operations on DataFrames such as merging, concatenating, binding, sub-setting, and so on. We will depict some of these operations in the following snippet:

# initialize two data frames
emp.details <- data.frame(
    empid = c('e001', 'e002', 'e003', 'e004'),
    name = c("Wade", "Steve", "Slade", "Bruce"),
    age = c(28, 85, 55, 45)
)
job.details <- data.frame(
    empid = c('e001', 'e002', 'e003', 'e004'),
    job = c("IT", "HR", "HR", "CS")
)

# view data frames
> emp.details
  empid  name age
1  e001  Wade  28
2  e002 Steve  85
3  e003 Slade  55
4  e004 Bruce  45
> job.details
  empid job
1  e001  IT
2  e002  HR
3  e003  HR
4  e004  CS

# binding and merging data frames
> cbind(emp.details, job.details)
  empid  name age empid job
1  e001  Wade  28  e001  IT
2  e002 Steve  85  e002  HR
3  e003 Slade  55  e003  HR
4  e004 Bruce  45  e004  CS
> merge(emp.details, job.details, by='empid')
  empid  name age job
1  e001  Wade  28  IT
2  e002 Steve  85  HR
3  e003 Slade  55  HR
4  e004 Bruce  45  CS

# subsetting data frame
> subset(emp.details, age > 50)
  empid  name age
2  e002 Steve  85
3  e003 Slade  55

Now that we have a good grasp of data structures, we will look at concepts related to functions in R in the next section.

Functions

So far we have dealt with various variables and data types and structures for storing data. Functions are just another data type or object in R, albeit a special one which allows us to operate on data and perform actions on data. Functions are useful for modularizing code and separating concerns where needed by dedicating specific actions and operations to different functions and implementing the logic needed for any action inside the function. We will be talking about two types of functions in this section: the built-in functions and the user-defined functions.

Built-in functions

There are several functions which come with the base installation of R and its core packages. You can access these built-in functions directly using the function name and you can get more functions as you install newer packages. We depict operations using a few built-in functions in the following snippet:

> sqrt(7)
[1] 2.645751
> mean(1:5)
[1] 3
> sum(1:5)
[1] 15
> sqrt(1:5)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068
> runif(5)
[1] 0.8880760 0.2925848 0.9240165 0.6535002 0.1891892
> rnorm(5)
[1]  1.90901035 -1.55611066 -0.40784306 -1.88185230  0.02035915

You can see from the previous examples that functions such as sqrt(…), mean(…), and sum(…) are built-in and pre-implemented. They can be used anytime in R without the need to define these functions explicitly or load other packages.

User-defined functions

While built-in functions are good, often you need to incorporate your own algorithms, logic, and processes for solving a problem. That is where you need to build you own functions. Typically, there are three main components in the function. They are mentioned as follows:

  • The environment(…) which contains the location map of the defined function and its variables

  • The formals(…) which depict the list of arguments which are used to call the function

  • The body(…) which is used to depict the code inside the function which contains the core logic of the function

Some depictions of user-defined functions are shown in the following code snippet:

# define the function
square <- function(data){
  return (data^2)
}

# inspect function components
> environment(square)
<environment: R_GlobalEnv>

> formals(square)
$data


> body(square)
{
    return(data^2)
}

# execute the function on data
> square(1:5)
[1]  1  4  9 16 25
> square(12)
[1] 144

We can see how user-defined functions can be defined using function(…) and we can also examine the various components of the function, as we discussed earlier, and also use them to operate on data.

Controlling code flow

When writing complete applications and scripts using R, the flow and execution of code is very important. The flow of code is based on statements, functions, variables, and data structures used in the code and it is all based on the algorithms, business logic, and other rules for solving the problem at hand. There are several constructs which can be used to control the flow of code and we will be discussing primarily the following two constructs:

  • Looping constructs

  • Conditional constructs

We will start with looking at various looping constructs for executing the same sections of code multiple times.

Looping constructs

Looping constructs basically involve using loops which are used to execute code blocks or sections repeatedly as needed. Usually the loop keeps executing the code block in its scope until some specific condition is met or some other conditional statements are used. There are three main types of loops in R:

  • for

  • while

  • repeat

We will explore all the three constructs with examples in the following code snippet:

# for loop
> for (i in 1:5) {
+     cat(paste(i," "))
+ }
1  2  3  4  5  

> sum <- 0
> for (i in 1:10){
+     sum <- sum + i
+ }

> sum
[1] 55

# while loop
> n <- 1
> while (n <= 5){
+     cat(paste(n, " "))
+     n <- n + 1
+ }
1  2  3  4  5  

# repeat loop
> i <- 1
> repeat{
+     cat(paste(i, " "))
+     if (i >= 5){
+         break  # break out of the infinite loop
+     }
+     i <- i + 1
+ }
1  2  3  4  5  

An important point to remember here is that, with larger amounts of data, vectorization-based constructs are more optimized than loops and we will cover some of them in the Advanced operations section later.

Conditional constructs

There are several conditional constructs which help us in executing and controlling the flow of code conditionally based on user-defined rules and conditions. This is very useful when we do not want to execute all possible code blocks in a script sequentially but we want to execute specific code blocks if and only if they meet or do not meet specific conditions.

There are mainly four constructs which are used frequently in R:

  • if or if…else

  • if…else if…else

  • ifelse(…)

  • switch(…)

The bottom two are functions compared to the other statements, which use the if, if…else, and if…else if…else syntax. We will look at them in the following code snippet with examples:

# using if
> num = 10
> if (num == 10){
+     cat('The number was 10')
+ }
The number was 10

# using if-else
> num = 5
> if (num == 10){
+     cat('The number was 10')
+ } else{
+     cat('The number was not 10')
+ }
The number was not 10

# using if-else if-else
> if (num == 10){
+     cat('The number was 10')
+ } else if (num == 5){
+     cat('The number was 5')
+ } else{
+     cat('No match found')
+ }
The number was 5

# using ifelse(...) function
> ifelse(num == 10, "Number was 10", "Number was not 10")
[1] "Number was not 10"

# using switch(...) function
> for (num in c("5","10","15")){
+     cat(
+         switch(num,
+             "5" = "five",
+             "7" = "seven",
+             "10" = "ten",
+             "No match found"
+     ), "\n")
+ }
five 
ten 
No match found

From the preceding snippet, we can see that switch(…) has a default option which can return a user-defined value when no match is found when evaluating the condition.

Advanced operations

We can perform several advanced vectorized operations in R, which is useful when dealing with large amounts of data and improves code performance with regards to time taken in executing code. Some advanced constructs in the apply family of functions will be covered in this section, as follows:

  • apply: Evaluates a function on the boundaries or margins of an array

  • lapply: Loops over a list and evaluates a function on each element

  • sapply: A more simplified version of the lapply(…) function

  • tapply: Evaluates a function over subsets of a vector

  • mapply: A multivariate version of the lapply(…) function

Let's look at how each of these functions work in further detail.

apply

As we mentioned earlier, the apply(…) function is used mainly to evaluate any defined function over the margins or boundaries of any array or matrix.

An important point to note here is that there are dedicated aggregation functions rowSums(…), rowMeans(…), colSums(…) and colMeans(…) which actually use apply internally but are more optimized and useful compared to other functions when operating on large arrays.

The following snippet depicts some aggregation functions being applied to a matrix:

# creating a 4x4 matrix
> mat <- matrix(1:16, nrow=4, ncol=4)

# view the matrix
> mat
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

# row sums
> apply(mat, 1, sum)
[1] 28 32 36 40
> rowSums(mat)
[1] 28 32 36 40

# row means
> apply(mat, 1, mean)
[1]  7  8  9 10
> rowMeans(mat)
[1]  7  8  9 10

# col sums
> apply(mat, 2, sum)
[1] 10 26 42 58
> colSums(mat)
[1] 10 26 42 58
 
# col means
> apply(mat, 2, mean)
[1]  2.5  6.5 10.5 14.5
> colMeans(mat)
[1]  2.5  6.5 10.5 14.5
 
# row quantiles
> apply(mat, 1, quantile, probs=c(0.25, 0.5, 0.75))
    [,1] [,2] [,3] [,4]
25%    4    5    6    7
50%    7    8    9   10
75%   10   11   12   13

You can see aggregations taking place without the need of extra looping constructs.

lapply

The lapply(…) function takes a list and a function as input parameters. Then it evaluates that function over each element of the list. If the input list is not a list, it is coerced to a list using the as.list(…) function before the final output is returned. All operations are vectorized and we will see an example in the following snippet:

# create and view a list of elements
> l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2))
> l
$nums
 [1]  1  2  3  4  5  6  7  8  9 10

$even
[1]  2  4  6  8 10

$odd
[1] 1 3 5 7 9

# use lapply on the list
> lapply(l, sum)
$nums
[1] 55

$even
[1] 30

$odd
[1] 25

sapply

The sapply(…) function is quite similar to the lapply(…) function, the only exception is that it will always try to simplify the final results of the computation. Suppose the final result is such that every element is of length 1, then sapply(…) would return a vector. If the length of every element in the result is greater than 1, then a matrix would be returned. If it is not able to simplify the results, then we end up getting the same result as lapply(…). The following example will make things clearer:

# create and view a sample list
> l <- list(nums=1:10, even=seq(2,10,2), odd=seq(1,10,2))
> l
$nums
 [1]  1  2  3  4  5  6  7  8  9 10

$even
[1]  2  4  6  8 10

$odd
[1] 1 3 5 7 9

# observe differences between lapply and sapply
> lapply(l, mean)
$nums
[1] 5.5

$even
[1] 6

$odd
[1] 5
> typeof(lapply(l, mean))
[1] "list"

> sapply(l, mean)
nums even  odd 
 5.5  6.0  5.0
> typeof(sapply(l, mean))
[1] "double"

tapply

The tapply(…) function is used to evaluate a function over specific subsets of input vectors. These subsets can be defined by the users.

The following example depicts the same:

> data <- 1:30
> data
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
> groups <- gl(3, 10)
> groups
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(data, groups, sum)
  1   2   3 
 55 155 255 
> tapply(data, groups, sum, simplify = FALSE)
$'1'
[1] 55

$'2'
[1] 155

$'3'
[1] 255

mapply

The mapply(…) function is used to evaluate a function in parallel over sets of arguments. This is basically a multi-variate version of the lapply(…) function.

The following example shows how we can build a list of vectors easily with mapply(…) as compared to using the rep(…) function multiple times otherwise:

> list(rep(1,4), rep(2,3), rep(3,2), rep(4,1))
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

> mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1

[[2]]
[1] 2 2 2

[[3]]
[1] 3 3

[[4]]
[1] 4

Visualizing data

One of the most important aspects of data analytics is to depict meaningful insights with crisp and concise visualizations. Data visualization is one of the most important aspects of exploratory data analysis, as well as an important medium to present results from any analyzes. There are three main popular plotting systems in R:

  • The base plotting system which comes with the basic installation of R

  • The lattice plotting system which produces better looking plots than the base plotting system

  • The ggplot2 package which is based on the grammar of graphics and produces beautiful, publication quality visualizations

The following snippet depicts visualizations using all three plotting systems on the popular iris dataset:

# load the data
> data(iris)

# base plotting system
> boxplot(Sepal.Length~Species,data=iris, 
+         xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This gives us a set of boxplots using the base plotting system as depicted in the following plot:

# lattice plotting system
> bwplot(Sepal.Length~Species,data=iris, xlab="Species", ylab="Sepal Length", main="Iris Boxplot")

This snippet helps us in creating a set of boxplots for the various species using the lattice plotting system:

# ggplot2 plotting system
> ggplot(data=iris, aes(x=Species, y=Sepal.Length)) + geom_boxplot(aes(fill=Species)) + 
+     ylab("Sepal Length") + ggtitle("Iris Boxplot") +
+     stat_summary(fun.y=mean, geom="point", shape=5, size=4) + theme_bw()

This code snippet gives us the following boxplots using the ggplot2 plotting system:

You can see from the preceding visualizations how each plotting system works and compare the plots across each system. Feel free to experiment with each system and visualize your own data!

Next steps

This quick refresher in getting started with R should gear you up for the upcoming chapters and also make you more comfortable with the R eco-system, its syntax and features. It's important to remember that you can always get help with any aspect of R in a variety of ways. Besides that, packages in R are perhaps going to be the most important tool in your arsenal for analyzing data. We will briefly list some important commands for each of these two aspects which may come in handy in the future.

Getting help

R has thousands of packages, functions, constructs and structures! Hence it is impossible for anyone to keep track and remember of them all. Luckily, help in R is readily available and you can always get detailed information and documentation with regard to R, its utilities, features and packages using the following commands:

  • help(<any_R_object>) or ?<any_R_object>: This provides help on any R object including functions, packages, data types, and so on

  • example(<function_name>): This provides a quick example for the mentioned function

  • apropos(<any_string>): This lists all functions containing the any_string term

Managing packages

R, being open-source and a community-driven language, has tons of packages for helping you with your analyzes which are free to download, install and use.

In the R community, the term library is often used interchangeably with the term package.

R provides the following utilities to manage packages:

  • install.packages(…): This installs a package from Comprehensive R Archive Network (CRAN). CRAN helps maintain and distribute the various versions and documentation of R

  • libPaths(…): This adds this library path to R

  • installed.packages(lib.loc=): This lists installed packages

  • update.packages(lib.loc=): This updates a package

  • remove.packages(…): This removes a package

  • path.package(…): This is the package loaded for the session

  • library(…): This loads a package in a script to use its functions or utilities

  • library(help=): This lists the functions in a package

This brings us to the end of our section on getting started with R and we will now look at some aspects of analytics, machine learning and text analytics.