Book Image

Jupyter for Data Science

By : Dan Toomey
Book Image

Jupyter for Data Science

By: Dan Toomey

Overview of this book

Jupyter Notebook is a web-based environment that enables interactive computing in notebook documents. It allows you to create documents that contain live code, equations, and visualizations. This book is a comprehensive guide to getting started with data science using the popular Jupyter notebook. If you are familiar with Jupyter notebook and want to learn how to use its capabilities to perform various data science tasks, this is the book for you! From data exploration to visualization, this book will take you through every step of the way in implementing an effective data science pipeline using Jupyter. You will also see how you can utilize Jupyter's features to share your documents and codes with your colleagues. The book also explains how Python 3, R, and Julia can be integrated with Jupyter for various data science tasks. By the end of this book, you will comfortably leverage the power of Jupyter to perform various tasks in data science successfully.
Table of Contents (17 chapters)
Title Page
Credits
About the Author
About the Reviewers
www.PacktPub.com
Customer Feedback
Preface

A first look at the Jupyter user interface


We can jump right in and see what Jupyter has to offer. A Jupyter screen looks like this:

Note

So, Jupyter is deployed as a website that can be accessed on your machine (or can be accessed like any other website across the internet).

We see the URL of the page, http://localhost:8888/tree. localhost is a pseudonym for a web server running on your machine. The website we are accessing on the web server is in a tree display. This is the default display. This conforms to the display of the projects within Jupyter. Jupyter displays objects in a tree layout much like Windows File Explorer. The main page lists a number of projects; each project is its own subdirectory and contains a further delineation of content for each. Depending on where you start Jupyter, the existing contents of the current directory will be included in the display as well.

Detailing the Jupyter tabs

On the web page, we have the soon to be familiar Jupyter logo and three tabs:

  • Files
  • Running
  • Clusters

The Files tab lists the objects available to Jupyter. The files used by Jupyter are stored as regular files on your disk. Jupyter provides context managers that know how to process the different types of files and programs you are using. You can see the Jupyter files when you use Windows Explorer to view your file contents (they have an .ipynb file extension). You can see non-Jupyter files listed in the Jupyter window as well.

The Running tab lists the notebooks that have been started. Jupyter keeps track of which notebooks are running. This tab allows you to control which notebooks are running at any time.

The Clusters tab is for environments where several machines are in use for running Jupyter.

Note

Cluster implementations of Jupyter are a topic worthy of their own, dedicated materials.

What actions can I perform with Jupyter?

Next, we see:

  • A prompt Select items to perform action
  • An Upload button
  • A New pull down menu and
  • A Refresh icon

The prompt tells you that you can select multiple items and then perform the same action on all of them. Most of the following actions (in the menus) can be performed over a single item or a selected set of items.

The Upload button will present a prompt to select a file to upload to Jupyter. This would typically be used to move a data file into the project for access in the case where Jupyter is running as a website in a remote location where you can't just copy the file to the disk where Jupyter is running.

The New pull down menu presents a list of choices of the different kinds of Jupyter projects (kernels) that are available:

We can see the list of objects that Jupyter knows how to create:

  • Text File: Create a text file for use in this folder. For example, if the notebook were to import a file you may create the file using this feature.
  • Folder: Yes, just like in Windows File Explorer.
  • Terminals Unavailable: Grayed out, this feature can be used in a Nix environment.
  • Notebooks: Grayed out,-this is not really a file type, but a heading to the different types of notebooks that this installation knows how to create.
  • Julia 0.4.5: Creates a Julia notebook where the coding is in the Julia language.
  • Python 3: Creates a notebook where the coding is in the Python language. This is the default.
  • R: Creates a notebook where the coding is in the R language.
  • Depending on which kernels you have installed in your installation, you may see other notebook types listed.

What objects can Jupyter manipulate?

If we started one of the notebooks (it would automatically be selected in the Jupyter object list) and now looked at the pulldown of actions against the objects selected we would see a display like the following:

We see that the menu action has changed to Rename, as that is the most likely action to be taken on one file and we have an icon to delete the project as well (the trashcan icon).

The item count is now 1 (we have one object selected in the list), the icon for the one item is a filled in blue square (denoting that it is a running project), and a familiar Home icon to bring us back to the Jupyter home page display in the previous screenshot.

The object's menu has choices for:

  • Folders: select the folders available
  • All Notebooks: select the Jupyter Notebooks
  • Running: select the running Jupyter Notebooks
  • Files: select the files in the directory

If we scroll down in the object display, we see a little different information in the list of objects available. Each of the objects listed has a type (denoted by the icon shape associated) and a name assigned by the user when it was created.

Each of the objects is a Jupyter project that can be accessed, shared, and moved on its own. Every project has a full name, as entered by the user creating the project, and an icon that portrays this entry as a project. We will see other Jupyter icons corresponding to other project components, as follows:

Viewing the Jupyter project display

If we pull down the New menu and select Python 3, Jupyter would create a new Python notebook and move to display its contents. We would see a display like the following:

We have created a new Jupyter Notebook and are in its display. The logo is there. The title defaults to Untitled, which we can change by clicking on it. There is an (autosaved) marker that tells you Jupyter has automatically stored your notebook to disk (and will continue to do so regularly as you work on it).

We now have a menu bar and a denotation that this notebook is using Python 3 as its source language. The menu choices are:

  • File: Standard file operations
  • Edit: For editing cell contents (more to come)
  • View: To change the display of the notebook
  • Insert: To insert a cell in the notebook
  • Cell: To change the format, usage of a cell
  • Kernel: To adjust the kernel used for the notebook
  • Help: To bring up the help system for Jupyter

File menu

The Filemenu has the following choices:

  • New Notebook: Similar to the pull down from the home page.
  • Open...: Open a notebook.
  • Make a Copy...: Copy a notebook.
  • Rename...: Rename a notebook.
  • Save and Checkpoint: Save the current notebook at a checkpoint. Checkpoints are specific points in a notebook's history that you want to maintain in order to return to a checkpoint if you change your mind about a recent set of changes.
  • Print Preview: Similar to any print preview that you have used otherwise.
  • Download as: Allows you to store the notebook in a variety of formats. The most notable formats would be PDF or Excel, which would allow you to share the notebook with users that do not have access to Jupyter.
  • Trusted Notebook: (The feature is grayed out). When a notebook is opened by a user, the server computes a signature with the user's key, and compares it with the signature stored in the notebook's metadata. If the signature matches, HTML and JavaScript output in the notebook will be trusted at load, otherwise it will be untrusted.
  • Close and Halt: Close the current notebook and stop it running in the Jupyter system.

Edit menu

The Edit menu has the following choices:

  • Cut Cells: Typical cut operation.
  • Copy Cells: Assuming you are used to the GUI operations of copying cells to memory buffer and later pasting into another location in the notebook.
  • Paste Cells Above: If you have selected a cell and if you have copied a cell, this option will not be grayed out and will paste the buffered cell above the current cell.
  • Paste Cells Below: Similar to the previous option.
  • Delete Cells: Will delete the selected cells.
  • Undo Delete Cells.
  • Split Cell: There is a style issue here, regarding how many statements you put into a cell. Many times, you will start with one cell containing a number of statements and split that cell up many times to break off individual or groups of statements into their own cell.
  • Merge Cell Above: Combine the current cell with the one above it.
  • Merge Cell Below: Similar to the previous option.
  • Move Cell Up: Move the current cell before the one above it.
  • Move Cell Down.
  • Edit Notebook Metadata: For advanced users to modify the internal programming language used by Jupyter for your notebook.
  • Find and Replace: Locate specific text within cells and possibly replace.

View menu

The View menu has the following choices:

  • Toggle Header: Toggle the display of the Jupyter header
  • Toggle Toolbar: Toggle the display of the Jupyter toolbar
  • Cell Toolbar: Change the displayed items for the cell being edited:
    • None: Don't display a cell toolbar
    • Edit Metadata: Edit a cells metadata directly
    • Raw Cell Format: Edit the cell raw format as used by Jupyter
    • Slideshow: Walk through the cells in a slideshow manner

Insert menu

The Insert menu has the following choices:

  • Insert Cell Above: Insert the copied buffer cell in front of the current cell
  • Insert Cell Below: Same as previous one

Cell menu

The Cell menu has the following choices:

  • Run Cells: Runs all of the cells in the notebook
  • Run Cells and Select Below: Runs cells and selects all of the cells below the current
  • Run Cells and Insert Below: Runs cells and adds a blank cell
  • Run All: Runs all of the cells
  • Run All Above: Runs all of the cells above the current
  • Run All Below: Runs all of the cells below the current
  • Cell Type: Changes the type of the selected cell(s) to:
    • Code: this is the default—the cell would expect to have language statements
    • Markdown: The cell contains HTML markdown,-typically used to display the notebook in the best manner (as it is a website, so has all of HTML available to it)
    • Raw NBConvert: This is an internal Jupyter format, basically plain text
  • Current Outputs: Whether to clear or continue the outputs from the cells
  • All Output

Kernel menu

The Kernel menu is used to control the underlying language engine used by the notebook. The menu choices are as follows. I think many of the choices in this menu are used very little:

  • Interrupt: Momentarily stops the underlying language engine and then lets it continue
  • Restart: Restarts the underlying language engine
  • Restart & Clear Output
  • Restart & Run All
  • Reconnect: If you were to interrupt the kernel, you would then need to reconnect to start running again
  • Change kernel: Changes the language used in this notebook to one available in your installation

Help menu

The help menu displays the help options for Jupyter and language context choices. For example, in our Python notebook we see choices for common Python libraries that may be used:

Icon toolbar

Just below the regular menu is an icon toolbar with many of the commonly used menu items for faster use, as seen in this view:

The icons correspond to the previous menu choices (listed in order of appearance):

  • File/Save the current notebook
  • Insert cell below
  • Cut current cells
  • Copy the current cells
  • Paste cells below
  • Move selected cells up
  • Move selected cells down
  • Run from selected cells down
  • Interrupt the kernel
  • Restart kernel
  • List of formats we can apply to the current cells
  • An icon to open a command palette with descriptive names
  • An icon to open the cell toolbar

How does it look when we execute scripts?

If we were to provide a name for the notebook, enter a simple Python script, and execute the notebook cells, we would see a display like this:

The script is:

name = "Dan Toomey"state = "MA"print(name + " lives in " + state)

We assign a value to the name and state variables and then print them out.

If you notice, I have placed the statements into two different cells. This is just for readability. They could all be in the same cell or three different cells.

There are line numbers assigned to each cell. The numbering always starts at 1 for the first cell, then as you move cells around the numbering may grow (as you can see the first cell is labeled cell 2 in the display).

Below the second cell, we have non-editable display results. Jupyter always displays any corresponding output of a cell just below. This could include error information as well.

Industry data science usage

This book is about Jupyter and data science. We have the introduction to Jupyter. Now, we can look at data science practices and then see how the two concepts work together.

Data science is used in many industries. It is interesting to note the predominant technologies involved and algorithms used by industry. We can see the same technologies available within Jupyter.

Some of the industries that are larger users of data science include:

Industry

Larger data science use

Technology/algorithms

Finance

Hedge funds

Python

Gambling

Establish odds

R

Insurance

Measure and price risk

Domino (R)

Retail banking

Risk, customer analytics, product analytics

R

Mining

Smart exploration, yield optimization

Python

Consumer products

Pricing and distribution

R

Healthcare

Drug discovery and trials

Python

Note

All of these data science investigations could be done in Jupyter, as the languages used are fully supported.

Real life examples

In this section we see several examples taken from current industry focus and apply them in Jupyter to ensure its utility.

Finance, Python - European call option valuation

There is an example of this at https://www.safaribooksonline.com/library/view/python-for-finance/9781491945360/ch03.htmlwhich is taken from the bookPython for Financeby Yves Hilpisch. The model used is fairly standard for finance work.

We want to arrive at the theoretical value of a call option. A call option is the right to buy a security, such as IBM stock, at a specific (strike) price within a certain time frame. The option is priced based on the riskiness or volatility of the security in relation to the strike price and current price. The example uses a European option which can only be exercised at maturity-this simplifies the problem set.

The example is using Black-Scholes model for option valuation where we have:

  • Initial stock index level S0 = 100
  • Strike price of the European call option K = 105
  • Time-to-maturity T = 1 year
  • Constant, riskless short rate r = 5%
  • Constant volatility σ  = 20%

These elements make up the following formula:

The algorithm used is as follows:

  1. Draw I (pseudo) random numbers from the standard normal distribution.
  2. Calculate all resulting index levels at maturity ST(i) for given z(i) in the previous equation. Calculate all inner values of the option at maturity as hT(i) = max(ST(i) - K,0).
  3. Estimate the option present value via the Monte Carlo estimator given in the following equation:

The script is as follows. We use numpy for the intense mathematics used. The rest of the coding is typical:

from numpy import *
# set parameters
S0 = 100.
K = 105.
T = 1.0
r = 0.05
sigma = 0.2
# how many samples we are using
I = 100000
random.seed(103)
z = random.standard_normal(I)
ST = S0 * exp((r - 0.5 * sigma ** 2) * T + sigma * sqrt(T) * z)
hT = maximum(ST - K, 0)
C0 = exp(-r * T) * sum(hT) / I
# tell user results
print ("Value of the European Call Option %5.3f" % C0)

The results under Jupyter are as shown in the following screenshot:

The 8.071 value corresponds with the published expected value 8.019 due to variance in the random numbers used. (I am seeding the random number generator to have reproducible results).

Finance, Python - Monte Carlo pricing

Another algorithm in popular use is Monte Carlo simulation. In Monte Carlo, as the name of the gambling resort implies, we simulate a number of chances taken in a scenario where we know the percentage outcomes of the different results, but do not know exactly what will happen in the next N chances. We can see this model being used at http://www.codeandfinance.com/pricing-options-monte-carlo.html. In this example, we are using Black-Scholes again, but in a different direct method where we see individual steps.

The coding is as follows. The Python coding style for Jupyter is slightly different than used directly in Python, as you can see by the changed imports near the top of the code. Rather than just pulling in the functions you want from a library, you pull in the entire library and the coding uses what is needed:

import datetime
import random # import gauss
import math #import exp, sqrt
random.seed(103)
def generate_asset_price(S,v,r,T):
    return S * exp((r - 0.5 * v**2) * T + v * sqrt(T) * gauss(0,1.0))
def call_payoff(S_T,K):
    return max(0.0,S_T-K)
S = 857.29 # underlying price
v = 0.2076 # vol of 20.76%
r = 0.0014 # rate of 0.14%
T = (datetime.date(2013,9,21) - datetime.date(2013,9,3)).days / 365.0
K = 860.
simulations = 90000
payoffs = []
discount_factor = math.exp(-r * T)
for i in xrange(simulations):
    S_T = generate_asset_price(S,v,r,T)
    payoffs.append(
        call_payoff(S_T, K)
    )
price = discount_factor * (sum(payoffs) / float(simulations))
print ('Price: %.4f' % price)

The results under Jupyter are shown as follows:

The result price of 14.4452 is close to the published value 14.5069.

Gambling, R - betting analysis

Some of the gambling games are really coin flips, with 50/50 chances of success. Along those lines we have coding from http://forumserver.twoplustwo.com/25/probability/flipping-coins-getting-3-row-1233506/ that determines the probability of a series of heads or tails in a coin flip, with a trigger that can be used if you know the coin/game is biased towards one result or the other.

We have the following script:

############################################### Biased/unbiased  recursion of heads OR tails##############################################import numpy as npimport mathN = 14     # number of flipsm = 3      # length of run (must be  > 1 and <= N/2)p = 0.5   # P(heads)prob = np.repeat(0.0,N)h = np.repeat(0.0,N)t = np.repeat(0.0,N)h[m] = math.pow(p,m)t[m] = math.pow(1-p,m)prob[m] = h[m] + t[m]for n in range(m+1,2*m):  h[n] = (1-p)*math.pow(p,m)  t[n] = p*math.pow(1-p,m)  prob[n] = prob[n-1] + h[n] + t[n]for n in range(2*m,N):  h[n] = ((1-p) - t[n-m] - prob[n-m-1]*(1-p))*math.pow(p,m)  t[n] = (p - h[n-m] - prob[n-m-1]*p)*math.pow(1-p,m)  prob[n] = prob[n-1] + h[n] + t[n]prob[N-1]

The preceding code produces the following output in Jupyter:

We end up with the probability of getting three heads in a row with an unbiased game. In this case, there is a 92% chance (within the range of tests we have run 14 flips).

Insurance, R - non-life insurance pricing

We have an example of using R to come up with the pricing for non-life products, specifically mopeds, at http://www.cybaea.net/journal/2012/03/13/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM/.The code first creates a table of the statistics available for the product line, then compares the pricing to actual statistics in use.

The first part of the code that accumulates the data is as follows:

con <- url("http://www2.math.su.se/~esbj/GLMbook/moppe.sas")
data <- readLines(con, n = 200L, warn = FALSE, encoding = "unknown")
close(con)
## Find the data range
data.start <- grep("^cards;", data) + 1L
data.end   <- grep("^;", data[data.start:999L]) + data.start - 2L
table.1.2  <- read.table(text = data[data.start:data.end],
                       header = FALSE,
                       sep = "",
                       quote = "",
col.names = c("premiekl", "moptva", "zon", "dur",
              "medskad", "antskad", "riskpre", "helpre", "cell"),
                         na.strings = NULL,
                         colClasses = c(rep("factor", 3), "numeric",
                                        rep("integer", 4), "NULL"),
                                            comment.char = "")
rm(con, data, data.start, data.end)    
# Remainder of Script adds comments/descriptions
comment(table.1.2) <-
  c("Title: Partial casco moped insurance from Wasa insurance, 1994--1999",
    "Source: http://www2.math.su.se/~esbj/GLMbook/moppe.sas",
    "Copyright: http://www2.math.su.se/~esbj/GLMbook/")
## See the SAS code for this derived field
table.1.2$skadfre = with(table.1.2, antskad / dur)
## English language column names as comments:
comment(table.1.2$premiekl) <-
  c("Name: Class",
    "Code: 1=Weight over 60kg and more than 2 gears",
    "Code: 2=Other")
comment(table.1.2$moptva)   <-
  c("Name: Age",
    "Code: 1=At most 1 year",
    "Code: 2=2 years or more")
comment(table.1.2$zon)      <-
  c("Name: Zone",
    "Code: 1=Central and semi-central parts of Sweden's three largest cities",
    "Code: 2=suburbs and middle-sized towns",
    "Code: 3=Lesser towns, except those in 5 or 7",
    "Code: 4=Small towns and countryside, except 5--7",
    "Code: 5=Northern towns",
    "Code: 6=Northern countryside",
    "Code: 7=Gotland (Sweden's largest island)")
comment(table.1.2$dur)      <-
  c("Name: Duration",
    "Unit: year")
comment(table.1.2$medskad)  <-
  c("Name: Claim severity",
    "Unit: SEK")
comment(table.1.2$antskad)  <- "Name: No. claims"
comment(table.1.2$riskpre)  <-
  c("Name: Pure premium",
    "Unit: SEK")
comment(table.1.2$helpre)   <-
  c("Name: Actual premium",
    "Note: The premium for one year according to the tariff in force 1999",
    "Unit: SEK")
comment(table.1.2$skadfre)  <-
  c("Name: Claim frequency",
    "Unit: /year")
## Save results for later
save(table.1.2, file = "table.1.2.RData")
## Print the table (not as pretty as the book)
print(table.1.2)

The resultant first 10 rows of the table are as follows:

   premiekl moptva zon    dur medskad antskad riskpre helpre    skadfre1         1      1   1   62.9   18256      17    4936   2049 0.270270272         1      1   2  112.9   13632       7     845   1230 0.062001773         1      1   3  133.1   20877       9    1411    762 0.067618334         1      1   4  376.6   13045       7     242    396 0.018587365         1      1   5    9.4       0       0       0    990 0.000000006         1      1   6   70.8   15000       1     212    594 0.014124297         1      1   7    4.4    8018       1    1829    396 0.227272738         1      2   1  352.1    8232      52    1216   1229 0.147685329         1      2   2  840.1    7418      69     609    738 0.0821330810        1      2   3 1378.3    7318      75     398    457 0.05441486

Then, we go through each product/statistics to determine whether the pricing for a product is in line with others. Note, therepos =clause on theinstall.packagesstatement is a fairly new addition to R:

# make sure the packages we want to use are installed
install.packages(c("data.table", "foreach", "ggplot2"), dependencies = TRUE, repos = "http://cran.us.r-project.org")
# load the data table we need
if (!exists("table.1.2"))
  load("table.1.2.RData")
library("foreach")
## We are looking to reproduce table 2.7 which we start building here,
## add columns for our results.
table27 <-
  data.frame(rating.factor =
               c(rep("Vehicle class", nlevels(table.1.2$premiekl)),
                 rep("Vehicle age",   nlevels(table.1.2$moptva)),
                 rep("Zone",          nlevels(table.1.2$zon))),
             class =
               c(levels(table.1.2$premiekl),
                 levels(table.1.2$moptva),
                 levels(table.1.2$zon)),
             stringsAsFactors = FALSE)
## Calculate duration per rating factor level and also set the
## contrasts (using the same idiom as in the code for the previous
## chapter). We use foreach here to execute the loop both for its
## side-effect (setting the contrasts) and to accumulate the sums.
# new.cols are set to claims, sums, levels
new.cols <-
  foreach (rating.factor = c("premiekl", "moptva", "zon"),
           .combine = rbind) %do%
{
  nclaims <- tapply(table.1.2$antskad, table.1.2[[rating.factor]], sum)
  sums <- tapply(table.1.2$dur, table.1.2[[rating.factor]], sum)
  n.levels <- nlevels(table.1.2[[rating.factor]])
  contrasts(table.1.2[[rating.factor]]) <-
    contr.treatment(n.levels)[rank(-sums, ties.method = "first"), ]
  data.frame(duration = sums, n.claims = nclaims)
}
table27 <- cbind(table27, new.cols)
rm(new.cols)
#build frequency distribution
model.frequency <-
  glm(antskad ~ premiekl + moptva + zon + offset(log(dur)),
      data = table.1.2, family = poisson)
rels <- coef( model.frequency )
rels <- exp( rels[1] + rels[-1] ) / exp( rels[1] )
table27$rels.frequency <-
    c(c(1, rels[1])[rank(-table27$duration[1:2], ties.method = "first")],
    c(1, rels[2])[rank(-table27$duration[3:4], ties.method = "first")],
    c(1, rels[3:8])[rank(-table27$duration[5:11], ties.method = "first")])
# note the severities involved
model.severity <-
  glm(medskad ~ premiekl + moptva + zon,
      data = table.1.2[table.1.2$medskad > 0, ],
      family = Gamma("log"), weights = antskad)
rels <- coef( model.severity )
rels <- exp( rels[1] + rels[-1] ) / exp( rels[1] )
## Aside: For the canonical link function use
## rels <- rels[1] / (rels[1] + rels[-1])
table27$rels.severity <-
    c(c(1, rels[1])[rank(-table27$duration[1:2], ties.method = "first")],
    c(1, rels[2])[rank(-table27$duration[3:4], ties.method = "first")],
    c(1, rels[3:8])[rank(-table27$duration[5:11], ties.method = "first")])
table27$rels.pure.premium <- with(table27, rels.frequency * rels.severity)
print(table27, digits = 2)

The resultant display is as follows:

   rating.factor class duration n.claims rels.frequency rels.severity1  Vehicle class     1     9833      391           1.00          1.002  Vehicle class     2     8825      395           0.78          0.5511   Vehicle age     1     1918      141           1.55          1.7921   Vehicle age     2    16740      645           1.00          1.0012          Zone     1     1451      206           7.10          1.2122          Zone     2     2486      209           4.17          1.073           Zone     3     2889      132           2.23          1.074           Zone     4    10069      207           1.00          1.005           Zone     5      246        6           1.20          1.216           Zone     6     1369       23           0.79          0.987           Zone     7      148        3           1.00          1.20   rels.pure.premium1               1.002               0.4211              2.7821              1.0012              8.6222              4.483               2.384               1.005               1.466               0.787               1.20

Here, we can see that some vehicle classes (2,6) are priced very low in comparison to statistics for that vehicle where as other are overpriced (1222).

Consumer products, R - marketing effectiveness

We take the example from a presentation I made atwww.dantoomeysoftware.com/Using_R_for_Marketing_Research.pptxlooking at the effectiveness of different ad campaigns for grape fruit juice.

The code is as follows:

#library(s20x)library(car)#read the dataset from an existing .csv filedf <- read.csv("C:/Users/Dan/grapeJuice.csv",header=T)#list the name of each variable (data column) and the first six rows of the datasethead(df)# basic statistics of the variablessummary(df)#set the 1 by 2 layout plot windowpar(mfrow = c(1,2))# boxplot to check if there are outliersboxplot(df$sales,horizontal = TRUE, xlab="sales")# histogram to explore the data distribution shapehist(df$sales,main="",xlab="sales",prob=T)lines(density(df$sales),lty="dashed",lwd=2.5,col="red")#divide the dataset into two sub dataset by ad_typesales_ad_nature = subset(df,ad_type==0)sales_ad_family = subset(df,ad_type==1)#calculate the mean of sales with different ad_typemean(sales_ad_nature$sales)mean(sales_ad_family$sales)#set the 1 by 2 layout plot windowpar(mfrow = c(1,2))# histogram to explore the data distribution shapeshist(sales_ad_nature$sales,main="",xlab="sales with nature production theme ad",prob=T)lines(density(sales_ad_nature$sales),lty="dashed",lwd=2.5,col="red")hist(sales_ad_family$sales,main="",xlab="sales with family health caring theme ad",prob=T)lines(density(sales_ad_family$sales),lty="dashed",lwd=2.5,col="red")

With output (several sections):

(raw data from file, first 10 rows):

sales

price

ad_type

price_apple

price_cookies

1

222

9.83

0

7.36

8.8

2

201

9.72

1

7.43

9.62

3

247

10.15

1

7.66

8.9

4

169

10.04

0

7.57

10.26

5

317

8.38

1

7.33

9.54

6

227

9.74

0

7.51

9.49

 

Statistics on the data are as follows:

     sales           price           ad_type     price_apple    Min.   :131.0   Min.   : 8.200   Min.   :0.0   Min.   :7.300   1st Qu.:182.5   1st Qu.: 9.585   1st Qu.:0.0   1st Qu.:7.438   Median :204.5   Median : 9.855   Median :0.5   Median :7.580   Mean   :216.7   Mean   : 9.738   Mean   :0.5   Mean   :7.659   3rd Qu.:244.2   3rd Qu.:10.268   3rd Qu.:1.0   3rd Qu.:7.805   Max.   :335.0   Max.   :10.490   Max.   :1.0   Max.   :8.290   price_cookies    Min.   : 8.790   1st Qu.: 9.190   Median : 9.515   Mean   : 9.622   3rd Qu.:10.140   Max.   :10.580  

The data shows the effectiveness of each campaign. Family sales are more effective:

  • 186.666666666667//mean of nature sales
  • 246.666666666667//mean of family sales

The difference is more pronounced on the histogram displays:

Using Docker with Jupyter

Docker is a mechanism that allows you to have many complete virtual instances of an application in one machine. Docker is used by many software firms to provide a fully scalable implementation of their services, and support as many concurrent users as needed.

Prior mechanisms for dealing with multiple instances shared common resources (such as disk address space). Under Docker, each instance is a complete entity separate from all others.

Implementing Jupyter on a Docker environment allows multiple users to access their own Jupyter instance, without having to worry about interfering with someone else's calculations.

The key feature of Docker is allowing for a variable number of instances of your notebook to be in use at any time. The Docker control system can be set up to create new instances for every user that accesses your notebook. All of this is built-in to Docker without programming; just use the user interface to decide how to create instances.

There are two ways you can use Docker:

  • From a public service
  • Installing Docker on your machine

Using a public Docker service

There are several services out there. I think they work pretty much the same way: sign up for the service, upload your notebook, monitor usage (the Docker control program tracks usage automatically). For example, if we use https://hub.docker.com/ we are really using a version repository for our notebook. Versioning is used in software development for tracking changes that are made over time. This also allows for multiple user access privileges as well:

  1. First, sign up. This provides authentication to the service vendor.
  2. Create a repository—where you will keep your version of the notebook.
  3. You will need Docker installed on your machine to pull/push notebooks from/to your repository.

Note

Installing Docker is operating system dependent. Go to the https://www.docker.com/ home page for instructions for your machine.

  1. Upload (push) your Jupyter image to your repository.
  2. Access your notebook in the repository. You can share the address (URL) of your notebook with others under control of Docker, making specific access rights to different users.
  3. From then on, it will work just as if it were running locally.

Installing Docker on your machine

Docker on your local machine would only be a precursor to posting on a public Docker service, unless the machine you are installing Docker on is accessible by others.

Note

Another option is to have Docker installed on your machine. It works exactly like the previous case, except you are managing the Docker image space.

How to share notebooks with others

There are several ways to share Jupyter Notebooks with others:

  • Email
  • Place onto Google Drive
  • Share on GitHub
  • Store as HTML on a web server
  • Install Jupyter on a web server

Can you email a notebook?

In order to email your notebook, the notebook must be converted to a plain text format, sent as an attachment to the recipient, and then the recipient must convert it back to the 'binary' notebook format.

Email attachments are normally converted to a well-defined MIME (Multi-purpose Internet Mail Extension) format. There is a program available that converts the notebook format, nb2mail, which converts the notebook to a notebook MIME format. The program is available at https://github.com/nfultz/nb2mail.

Usage is as follows:

  • Install nb2mail using pip command (see website)
  • Convert your selected notebook to MIME format
  • Send to recipient
  • The recipient MIME conversion process will store the file in the correct fashion (assuming they have also installed nb2mail)

Sharing a notebook on Google Drive

Google Drive can be used to store your notebook profile information. This might be used when combined with the previous emailing of a notebook to another user. The recipient could use a Google Drive profile that would preclude anyone without the profile information from interacting with the notebook.

You install the python extension (from https://github.com/jupyter/jupyter-drive) using pip and then python -m. From then on, you access the notebooks with the Google Drive profiles, as ipython notebook -profile <profilename>.

Sharing on GitHub

GitHub (and others) allow you to place a notebook on their servers that, once there, can be accessed directly using the nbviewer. The server has installed Python (and other language) coding needed to support your notebook. The nbviewer is a read-only use of your notebook, and is not interactive.

The nbviewer is available at https://github.com/jupyter/nbviewer. The site includes specific parameters which need to be added to the ipython notebook command, such as the command to start the viewer.

Store as HTML on a web server

A built-in feature of notebooks is to export the notebook into different formats. One of those is HTML. In this manner, you could export the notebook into HTML and copy the file(s) onto your web server as changes are made.

The command is jupyter nbconvert <notebook name>.ipynb --to html.

Again, this would be a non-interactive, read-only version of your notebook.

Install Jupyter on a web server

Jupyter is deployed as a web application. If you have direct access to a web server, you could install Jupyter on the web server, create notebooks on that web server, and then the notebooks would be available to others that are completely dynamic.

As a web server you also have control over access to the web server so can control who can access your notebook.

This is an advanced interaction that would require working with your webmaster to determine the correct approach.

How can you secure a notebook?

There are two aspects to security in Jupyter Notebooks:

  • Making sure only specific users can access your notebook
  • Making sure your notebook is not used to host malicious coding

Access control

While many of the uses of Jupyter are solely for educating others, there are instances where the information being accessed is and should remain confidential. Jupyter allows you to put up barriers to entry to your notebook in several manners.

When we identify the user, we are authenticating that user. This is normally done by presenting a login challenge before allowing entry, where the user has to enter a username and password.

If the instance of Jupyter hosting, your notebook is installed on a web server and you can use the web server's access control to limit access to your notebook. Further, most of the vendors that support notebook hosting provide a mechanism to limit access to specific users.

Malicious content

The other aspect of security is to make sure the contents of your notebooks are not malicious. You should make sure your notebook is safe, as follows:

  • Ensure that HTML is sanitized (looking for malicious HTML coding and subverting it)
  • Do not allow your notebook to execute external JavaScript
  • Check cell contents that may be malicious are challenged in a server environment
  • Sanitize output of cells so as not to produce unwanted effects on user machines