We can jump right in and see what Jupyter has to offer. A Jupyter screen looks like this:
Note
So, Jupyter is deployed as a website that can be accessed on your machine (or can be accessed like any other website across the internet).
We see the URL of the page, http://localhost:8888/tree
. localhost
is a pseudonym for a web server running on your machine. The website we are accessing on the web server is in a tree display. This is the default display. This conforms to the display of the projects within Jupyter. Jupyter displays objects in a tree layout much like Windows File Explorer. The main page lists a number of projects; each project is its own subdirectory and contains a further delineation of content for each. Depending on where you start Jupyter, the existing contents of the current directory will be included in the display as well.
On the web page, we have the soon to be familiar Jupyter logo and three tabs:
Files
Running
Clusters
The Files
tab lists the objects available to Jupyter. The files used by Jupyter are stored as regular files on your disk. Jupyter provides context managers that know how to process the different types of files and programs you are using. You can see the Jupyter files when you use Windows Explorer to view your file contents (they have an .ipynb
file extension). You can see non-Jupyter files listed in the Jupyter window as well.
The Running
tab lists the notebooks that have been started. Jupyter keeps track of which notebooks are running. This tab allows you to control which notebooks are running at any time.
The Clusters
tab is for environments where several machines are in use for running Jupyter.
Next, we see:
- A prompt
Select items to perform action
- An
Upload
button - A
New
pull down menu and - A Refresh icon
The prompt tells you that you can select multiple items and then perform the same action on all of them. Most of the following actions (in the menus) can be performed over a single item or a selected set of items.
The Upload
button will present a prompt to select a file to upload to Jupyter. This would typically be used to move a data file into the project for access in the case where Jupyter is running as a website in a remote location where you can't just copy the file to the disk where Jupyter is running.
The New
pull down menu presents a list of choices of the different kinds of Jupyter projects (kernels) that are available:
We can see the list of objects that Jupyter knows how to create:
Text File
: Create a text file for use in this folder. For example, if the notebook were to import a file you may create the file using this feature.Folder
: Yes, just like in Windows File Explorer.Terminals Unavailable
: Grayed out, this feature can be used in a Nix environment.Notebooks
: Grayed out,-this is not really a file type, but a heading to the different types of notebooks that this installation knows how to create.Julia 0.4.5
: Creates a Julia notebook where the coding is in the Julia language.Python 3
: Creates a notebook where the coding is in the Python language. This is the default.R
: Creates a notebook where the coding is in the R language.- Depending on which kernels you have installed in your installation, you may see other notebook types listed.
If we started one of the notebooks (it would automatically be selected in the Jupyter object list) and now looked at the pulldown of actions against the objects selected we would see a display like the following:
We see that the menu action has changed to Rename
, as that is the most likely action to be taken on one file and we have an icon to delete the project as well (the trashcan icon).
The item count is now 1
(we have one object selected in the list), the icon for the one item is a filled in blue square (denoting that it is a running project), and a familiar Home icon to bring us back to the Jupyter home page display in the previous screenshot.
The object's menu has choices for:
Folders
: select the folders availableAll Notebooks
: select the Jupyter NotebooksRunning
: select the running Jupyter NotebooksFiles
: select the files in the directory
If we scroll down in the object display, we see a little different information in the list of objects available. Each of the objects listed has a type (denoted by the icon shape associated) and a name assigned by the user when it was created.
Each of the objects is a Jupyter project that can be accessed, shared, and moved on its own. Every project has a full name, as entered by the user creating the project, and an icon that portrays this entry as a project. We will see other Jupyter icons corresponding to other project components, as follows:
If we pull down the New
menu and select Python 3
, Jupyter would create a new Python notebook and move to display its contents. We would see a display like the following:
We have created a new Jupyter Notebook and are in its display. The logo is there. The title defaults to Untitled
, which we can change by clicking on it. There is an (autosaved)
marker that tells you Jupyter has automatically stored your notebook to disk (and will continue to do so regularly as you work on it).
We now have a menu bar and a denotation that this notebook is using Python 3 as its source language. The menu choices are:
File
: Standard file operationsEdit
: For editing cell contents (more to come)View
: To change the display of the notebookInsert
: To insert a cell in the notebookCell
: To change the format, usage of a cellKernel
: To adjust the kernel used for the notebookHelp:
To bring up the help system for Jupyter
The File
menu has the following choices:
New Notebook
: Similar to the pull down from the home page.Open...
: Open a notebook.Make a Copy...
: Copy a notebook.Rename...
: Rename a notebook.Save and Checkpoint
: Save the current notebook at a checkpoint. Checkpoints are specific points in a notebook's history that you want to maintain in order to return to a checkpoint if you change your mind about a recent set of changes.Print Preview
: Similar to any print preview that you have used otherwise.Download as
: Allows you to store the notebook in a variety of formats. The most notable formats would be PDF or Excel, which would allow you to share the notebook with users that do not have access to Jupyter.Trusted Notebook
: (The feature is grayed out). When a notebook is opened by a user, the server computes a signature with the user's key, and compares it with the signature stored in the notebook's metadata. If the signature matches, HTML and JavaScript output in the notebook will be trusted at load, otherwise it will be untrusted.Close and Halt
: Close the current notebook and stop it running in the Jupyter system.
The Edit
menu has the following choices:
Cut Cells
: Typical cut operation.Copy Cells
: Assuming you are used to the GUI operations of copying cells to memory buffer and later pasting into another location in the notebook.Paste Cells Above
: If you have selected a cell and if you have copied a cell, this option will not be grayed out and will paste the buffered cell above the current cell.Paste Cells Below
: Similar to the previous option.Delete Cells
: Will delete the selected cells.Undo Delete Cells
.Split Cell
: There is a style issue here, regarding how many statements you put into a cell. Many times, you will start with one cell containing a number of statements and split that cell up many times to break off individual or groups of statements into their own cell.Merge Cell Above
: Combine the current cell with the one above it.Merge Cell Below
: Similar to the previous option.Move Cell Up
: Move the current cell before the one above it.Move Cell Down
.Edit Notebook Metadata
: For advanced users to modify the internal programming language used by Jupyter for your notebook.Find and Replace
: Locate specific text within cells and possibly replace.
The View
menu has the following choices:
Toggle Header
: Toggle the display of the Jupyter headerToggle Toolbar
: Toggle the display of the Jupyter toolbarCell Toolbar
: Change the displayed items for the cell being edited:None
: Don't display a cell toolbarEdit Metadata
: Edit a cells metadata directlyRaw Cell Format
: Edit the cell raw format as used by JupyterSlideshow
: Walk through the cells in a slideshow manner
The Insert
menu has the following choices:
Insert Cell Above
: Insert the copied buffer cell in front of the current cellInsert Cell Below
: Same as previous one
The Cell
menu has the following choices:
Run Cells
: Runs all of the cells in the notebookRun Cells and Select Below
: Runs cells and selects all of the cells below the currentRun Cells and Insert Below
: Runs cells and adds a blank cellRun All
: Runs all of the cellsRun All Above
: Runs all of the cells above the currentRun All Below
: Runs all of the cells below the currentCell Type
: Changes the type of the selected cell(s) to:Code
: this is the default—the cell would expect to have language statementsMarkdown
: The cell contains HTML markdown,-typically used to display the notebook in the best manner (as it is a website, so has all of HTML available to it)Raw NBConvert
: This is an internal Jupyter format, basically plain text
Current Outputs
: Whether to clear or continue the outputs from the cellsAll Output
The Kernel
menu is used to control the underlying language engine used by the notebook. The menu choices are as follows. I think many of the choices in this menu are used very little:
Interrupt
: Momentarily stops the underlying language engine and then lets it continueRestart
: Restarts the underlying language engineRestart & Clear Output
Restart & Run All
Reconnect
: If you were to interrupt the kernel, you would then need to reconnect to start running againChange kernel
: Changes the language used in this notebook to one available in your installation
The help menu displays the help options for Jupyter and language context choices. For example, in our Python notebook we see choices for common Python libraries that may be used:
Just below the regular menu is an icon toolbar with many of the commonly used menu items for faster use, as seen in this view:
The icons correspond to the previous menu choices (listed in order of appearance):
- File/Save the current notebook
- Insert cell below
- Cut current cells
- Copy the current cells
- Paste cells below
- Move selected cells up
- Move selected cells down
- Run from selected cells down
- Interrupt the kernel
- Restart kernel
- List of formats we can apply to the current cells
- An icon to open a command palette with descriptive names
- An icon to open the cell toolbar
If we were to provide a name for the notebook, enter a simple Python script, and execute the notebook cells, we would see a display like this:
The script is:
name = "Dan Toomey"state = "MA"print(name + " lives in " + state)
We assign a value to the name and state variables and then print them out.
If you notice, I have placed the statements into two different cells. This is just for readability. They could all be in the same cell or three different cells.
There are line numbers assigned to each cell. The numbering always starts at 1 for the first cell, then as you move cells around the numbering may grow (as you can see the first cell is labeled cell 2 in the display).
Below the second cell, we have non-editable display results. Jupyter always displays any corresponding output of a cell just below. This could include error information as well.
This book is about Jupyter and data science. We have the introduction to Jupyter. Now, we can look at data science practices and then see how the two concepts work together.
Data science is used in many industries. It is interesting to note the predominant technologies involved and algorithms used by industry. We can see the same technologies available within Jupyter.
Some of the industries that are larger users of data science include:
Industry | Larger data science use | Technology/algorithms |
Finance | Hedge funds | Python |
Gambling | Establish odds | R |
Insurance | Measure and price risk | Domino (R) |
Retail banking | Risk, customer analytics, product analytics | R |
Mining | Smart exploration, yield optimization | Python |
Consumer products | Pricing and distribution | R |
Healthcare | Drug discovery and trials | Python |
In this section we see several examples taken from current industry focus and apply them in Jupyter to ensure its utility.
There is an example of this at https://www.safaribooksonline.com/library/view/python-for-finance/9781491945360/ch03.htmlwhich is taken from the bookPython for Financeby Yves Hilpisch. The model used is fairly standard for finance work.
We want to arrive at the theoretical value of a call option. A call option is the right to buy a security, such as IBM stock, at a specific (strike) price within a certain time frame. The option is priced based on the riskiness or volatility of the security in relation to the strike price and current price. The example uses a European option which can only be exercised at maturity-this simplifies the problem set.
The example is using Black-Scholes model for option valuation where we have:
- Initial stock index level S0 = 100
- Strike price of the European call option K = 105
- Time-to-maturity T = 1 year
- Constant, riskless short rate r = 5%
- Constant volatility σ = 20%
These elements make up the following formula:
The algorithm used is as follows:
- Draw I (pseudo) random numbers from the standard normal distribution.
- Calculate all resulting index levels at maturity ST(i) for given z(i) in the previous equation. Calculate all inner values of the option at maturity as hT(i) = max(ST(i) - K,0).
- Estimate the option present value via the Monte Carlo estimator given in the following equation:
The script is as follows. We use numpy
for the intense mathematics used. The rest of the coding is typical:
from numpy import * # set parameters S0 = 100. K = 105. T = 1.0 r = 0.05 sigma = 0.2 # how many samples we are using I = 100000 random.seed(103) z = random.standard_normal(I) ST = S0 * exp((r - 0.5 * sigma ** 2) * T + sigma * sqrt(T) * z) hT = maximum(ST - K, 0) C0 = exp(-r * T) * sum(hT) / I # tell user results print ("Value of the European Call Option %5.3f" % C0)
The results under Jupyter are as shown in the following screenshot:
The 8.071
value corresponds with the published expected value 8.019 due to variance in the random numbers used. (I am seeding the random number generator to have reproducible results).
Another algorithm in popular use is Monte Carlo simulation. In Monte Carlo, as the name of the gambling resort implies, we simulate a number of chances taken in a scenario where we know the percentage outcomes of the different results, but do not know exactly what will happen in the next N chances. We can see this model being used at http://www.codeandfinance.com/pricing-options-monte-carlo.html. In this example, we are using Black-Scholes again, but in a different direct method where we see individual steps.
The coding is as follows. The Python coding style for Jupyter is slightly different than used directly in Python, as you can see by the changed imports near the top of the code. Rather than just pulling in the functions you want from a library, you pull in the entire library and the coding uses what is needed:
import datetime import random # import gauss import math #import exp, sqrt random.seed(103) def generate_asset_price(S,v,r,T): return S * exp((r - 0.5 * v**2) * T + v * sqrt(T) * gauss(0,1.0)) def call_payoff(S_T,K): return max(0.0,S_T-K) S = 857.29 # underlying price v = 0.2076 # vol of 20.76% r = 0.0014 # rate of 0.14% T = (datetime.date(2013,9,21) - datetime.date(2013,9,3)).days / 365.0 K = 860. simulations = 90000 payoffs = [] discount_factor = math.exp(-r * T) for i in xrange(simulations): S_T = generate_asset_price(S,v,r,T) payoffs.append( call_payoff(S_T, K) ) price = discount_factor * (sum(payoffs) / float(simulations)) print ('Price: %.4f' % price)
The results under Jupyter are shown as follows:
The result price of 14.4452
is close to the published value 14.5069.
Some of the gambling games are really coin flips, with 50/50 chances of success. Along those lines we have coding from http://forumserver.twoplustwo.com/25/probability/flipping-coins-getting-3-row-1233506/ that determines the probability of a series of heads or tails in a coin flip, with a trigger that can be used if you know the coin/game is biased towards one result or the other.
We have the following script:
############################################### Biased/unbiased recursion of heads OR tails##############################################import numpy as npimport mathN = 14 # number of flipsm = 3 # length of run (must be > 1 and <= N/2)p = 0.5 # P(heads)prob = np.repeat(0.0,N)h = np.repeat(0.0,N)t = np.repeat(0.0,N)h[m] = math.pow(p,m)t[m] = math.pow(1-p,m)prob[m] = h[m] + t[m]for n in range(m+1,2*m): h[n] = (1-p)*math.pow(p,m) t[n] = p*math.pow(1-p,m) prob[n] = prob[n-1] + h[n] + t[n]for n in range(2*m,N): h[n] = ((1-p) - t[n-m] - prob[n-m-1]*(1-p))*math.pow(p,m) t[n] = (p - h[n-m] - prob[n-m-1]*p)*math.pow(1-p,m) prob[n] = prob[n-1] + h[n] + t[n]prob[N-1]
The preceding code produces the following output in Jupyter:
We end up with the probability of getting three heads in a row with an unbiased game. In this case, there is a 92% chance (within the range of tests we have run 14 flips).
We have an example of using R to come up with the pricing for non-life products, specifically mopeds, at http://www.cybaea.net/journal/2012/03/13/R-code-for-Chapter-2-of-Non_Life-Insurance-Pricing-with-GLM/.The code first creates a table of the statistics available for the product line, then compares the pricing to actual statistics in use.
The first part of the code that accumulates the data is as follows:
con <- url("http://www2.math.su.se/~esbj/GLMbook/moppe.sas") data <- readLines(con, n = 200L, warn = FALSE, encoding = "unknown") close(con) ## Find the data range data.start <- grep("^cards;", data) + 1L data.end <- grep("^;", data[data.start:999L]) + data.start - 2L table.1.2 <- read.table(text = data[data.start:data.end], header = FALSE, sep = "", quote = "", col.names = c("premiekl", "moptva", "zon", "dur", "medskad", "antskad", "riskpre", "helpre", "cell"), na.strings = NULL, colClasses = c(rep("factor", 3), "numeric", rep("integer", 4), "NULL"), comment.char = "") rm(con, data, data.start, data.end) # Remainder of Script adds comments/descriptions comment(table.1.2) <- c("Title: Partial casco moped insurance from Wasa insurance, 1994--1999", "Source: http://www2.math.su.se/~esbj/GLMbook/moppe.sas", "Copyright: http://www2.math.su.se/~esbj/GLMbook/") ## See the SAS code for this derived field table.1.2$skadfre = with(table.1.2, antskad / dur) ## English language column names as comments: comment(table.1.2$premiekl) <- c("Name: Class", "Code: 1=Weight over 60kg and more than 2 gears", "Code: 2=Other") comment(table.1.2$moptva) <- c("Name: Age", "Code: 1=At most 1 year", "Code: 2=2 years or more") comment(table.1.2$zon) <- c("Name: Zone", "Code: 1=Central and semi-central parts of Sweden's three largest cities", "Code: 2=suburbs and middle-sized towns", "Code: 3=Lesser towns, except those in 5 or 7", "Code: 4=Small towns and countryside, except 5--7", "Code: 5=Northern towns", "Code: 6=Northern countryside", "Code: 7=Gotland (Sweden's largest island)") comment(table.1.2$dur) <- c("Name: Duration", "Unit: year") comment(table.1.2$medskad) <- c("Name: Claim severity", "Unit: SEK") comment(table.1.2$antskad) <- "Name: No. claims" comment(table.1.2$riskpre) <- c("Name: Pure premium", "Unit: SEK") comment(table.1.2$helpre) <- c("Name: Actual premium", "Note: The premium for one year according to the tariff in force 1999", "Unit: SEK") comment(table.1.2$skadfre) <- c("Name: Claim frequency", "Unit: /year") ## Save results for later save(table.1.2, file = "table.1.2.RData") ## Print the table (not as pretty as the book) print(table.1.2)
The resultant first 10 rows of the table are as follows:
premiekl moptva zon dur medskad antskad riskpre helpre skadfre1 1 1 1 62.9 18256 17 4936 2049 0.270270272 1 1 2 112.9 13632 7 845 1230 0.062001773 1 1 3 133.1 20877 9 1411 762 0.067618334 1 1 4 376.6 13045 7 242 396 0.018587365 1 1 5 9.4 0 0 0 990 0.000000006 1 1 6 70.8 15000 1 212 594 0.014124297 1 1 7 4.4 8018 1 1829 396 0.227272738 1 2 1 352.1 8232 52 1216 1229 0.147685329 1 2 2 840.1 7418 69 609 738 0.0821330810 1 2 3 1378.3 7318 75 398 457 0.05441486
Then, we go through each product/statistics to determine whether the pricing for a product is in line with others. Note, therepos =
clause on theinstall.packages
statement is a fairly new addition to R:
# make sure the packages we want to use are installed install.packages(c("data.table", "foreach", "ggplot2"), dependencies = TRUE, repos = "http://cran.us.r-project.org") # load the data table we need if (!exists("table.1.2")) load("table.1.2.RData") library("foreach") ## We are looking to reproduce table 2.7 which we start building here, ## add columns for our results. table27 <- data.frame(rating.factor = c(rep("Vehicle class", nlevels(table.1.2$premiekl)), rep("Vehicle age", nlevels(table.1.2$moptva)), rep("Zone", nlevels(table.1.2$zon))), class = c(levels(table.1.2$premiekl), levels(table.1.2$moptva), levels(table.1.2$zon)), stringsAsFactors = FALSE) ## Calculate duration per rating factor level and also set the ## contrasts (using the same idiom as in the code for the previous ## chapter). We use foreach here to execute the loop both for its ## side-effect (setting the contrasts) and to accumulate the sums. # new.cols are set to claims, sums, levels new.cols <- foreach (rating.factor = c("premiekl", "moptva", "zon"), .combine = rbind) %do% { nclaims <- tapply(table.1.2$antskad, table.1.2[[rating.factor]], sum) sums <- tapply(table.1.2$dur, table.1.2[[rating.factor]], sum) n.levels <- nlevels(table.1.2[[rating.factor]]) contrasts(table.1.2[[rating.factor]]) <- contr.treatment(n.levels)[rank(-sums, ties.method = "first"), ] data.frame(duration = sums, n.claims = nclaims) } table27 <- cbind(table27, new.cols) rm(new.cols) #build frequency distribution model.frequency <- glm(antskad ~ premiekl + moptva + zon + offset(log(dur)), data = table.1.2, family = poisson) rels <- coef( model.frequency ) rels <- exp( rels[1] + rels[-1] ) / exp( rels[1] ) table27$rels.frequency <- c(c(1, rels[1])[rank(-table27$duration[1:2], ties.method = "first")], c(1, rels[2])[rank(-table27$duration[3:4], ties.method = "first")], c(1, rels[3:8])[rank(-table27$duration[5:11], ties.method = "first")]) # note the severities involved model.severity <- glm(medskad ~ premiekl + moptva + zon, data = table.1.2[table.1.2$medskad > 0, ], family = Gamma("log"), weights = antskad) rels <- coef( model.severity ) rels <- exp( rels[1] + rels[-1] ) / exp( rels[1] ) ## Aside: For the canonical link function use ## rels <- rels[1] / (rels[1] + rels[-1]) table27$rels.severity <- c(c(1, rels[1])[rank(-table27$duration[1:2], ties.method = "first")], c(1, rels[2])[rank(-table27$duration[3:4], ties.method = "first")], c(1, rels[3:8])[rank(-table27$duration[5:11], ties.method = "first")]) table27$rels.pure.premium <- with(table27, rels.frequency * rels.severity) print(table27, digits = 2)
The resultant display is as follows:
rating.factor class duration n.claims rels.frequency rels.severity1 Vehicle class 1 9833 391 1.00 1.002 Vehicle class 2 8825 395 0.78 0.5511 Vehicle age 1 1918 141 1.55 1.7921 Vehicle age 2 16740 645 1.00 1.0012 Zone 1 1451 206 7.10 1.2122 Zone 2 2486 209 4.17 1.073 Zone 3 2889 132 2.23 1.074 Zone 4 10069 207 1.00 1.005 Zone 5 246 6 1.20 1.216 Zone 6 1369 23 0.79 0.987 Zone 7 148 3 1.00 1.20 rels.pure.premium1 1.002 0.4211 2.7821 1.0012 8.6222 4.483 2.384 1.005 1.466 0.787 1.20
Here, we can see that some vehicle classes (2
,6
) are priced very low in comparison to statistics for that vehicle where as other are overpriced (12
, 22
).
We take the example from a presentation I made atwww.dantoomeysoftware.com/Using_R_for_Marketing_Research.pptxlooking at the effectiveness of different ad campaigns for grape fruit juice.
The code is as follows:
#library(s20x)library(car)#read the dataset from an existing .csv filedf <- read.csv("C:/Users/Dan/grapeJuice.csv",header=T)#list the name of each variable (data column) and the first six rows of the datasethead(df)# basic statistics of the variablessummary(df)#set the 1 by 2 layout plot windowpar(mfrow = c(1,2))# boxplot to check if there are outliersboxplot(df$sales,horizontal = TRUE, xlab="sales")# histogram to explore the data distribution shapehist(df$sales,main="",xlab="sales",prob=T)lines(density(df$sales),lty="dashed",lwd=2.5,col="red")#divide the dataset into two sub dataset by ad_typesales_ad_nature = subset(df,ad_type==0)sales_ad_family = subset(df,ad_type==1)#calculate the mean of sales with different ad_typemean(sales_ad_nature$sales)mean(sales_ad_family$sales)#set the 1 by 2 layout plot windowpar(mfrow = c(1,2))# histogram to explore the data distribution shapeshist(sales_ad_nature$sales,main="",xlab="sales with nature production theme ad",prob=T)lines(density(sales_ad_nature$sales),lty="dashed",lwd=2.5,col="red")hist(sales_ad_family$sales,main="",xlab="sales with family health caring theme ad",prob=T)lines(density(sales_ad_family$sales),lty="dashed",lwd=2.5,col="red")
With output (several sections):
(raw data from file, first 10 rows):
sales | price | ad_type | price_apple | price_cookies | |
1 | 222 | 9.83 | 0 | 7.36 | 8.8 |
2 | 201 | 9.72 | 1 | 7.43 | 9.62 |
3 | 247 | 10.15 | 1 | 7.66 | 8.9 |
4 | 169 | 10.04 | 0 | 7.57 | 10.26 |
5 | 317 | 8.38 | 1 | 7.33 | 9.54 |
6 | 227 | 9.74 | 0 | 7.51 | 9.49 |
Statistics on the data are as follows:
sales price ad_type price_apple Min. :131.0 Min. : 8.200 Min. :0.0 Min. :7.300 1st Qu.:182.5 1st Qu.: 9.585 1st Qu.:0.0 1st Qu.:7.438 Median :204.5 Median : 9.855 Median :0.5 Median :7.580 Mean :216.7 Mean : 9.738 Mean :0.5 Mean :7.659 3rd Qu.:244.2 3rd Qu.:10.268 3rd Qu.:1.0 3rd Qu.:7.805 Max. :335.0 Max. :10.490 Max. :1.0 Max. :8.290 price_cookies Min. : 8.790 1st Qu.: 9.190 Median : 9.515 Mean : 9.622 3rd Qu.:10.140 Max. :10.580
The data shows the effectiveness of each campaign. Family sales are more effective:
- 186.666666666667//mean of nature sales
- 246.666666666667//mean of family sales
The difference is more pronounced on the histogram displays:
Docker is a mechanism that allows you to have many complete virtual instances of an application in one machine. Docker is used by many software firms to provide a fully scalable implementation of their services, and support as many concurrent users as needed.
Prior mechanisms for dealing with multiple instances shared common resources (such as disk address space). Under Docker, each instance is a complete entity separate from all others.
Implementing Jupyter on a Docker environment allows multiple users to access their own Jupyter instance, without having to worry about interfering with someone else's calculations.
The key feature of Docker is allowing for a variable number of instances of your notebook to be in use at any time. The Docker control system can be set up to create new instances for every user that accesses your notebook. All of this is built-in to Docker without programming; just use the user interface to decide how to create instances.
There are two ways you can use Docker:
- From a public service
- Installing Docker on your machine
There are several services out there. I think they work pretty much the same way: sign up for the service, upload your notebook, monitor usage (the Docker control program tracks usage automatically). For example, if we use https://hub.docker.com/ we are really using a version repository for our notebook. Versioning is used in software development for tracking changes that are made over time. This also allows for multiple user access privileges as well:
- First, sign up. This provides authentication to the service vendor.
- Create a repository—where you will keep your version of the notebook.
- You will need Docker installed on your machine to pull/push notebooks from/to your repository.
Note
Installing Docker is operating system dependent. Go to the https://www.docker.com/ home page for instructions for your machine.
- Upload (push) your Jupyter image to your repository.
- Access your notebook in the repository. You can share the address (URL) of your notebook with others under control of Docker, making specific access rights to different users.
- From then on, it will work just as if it were running locally.
Docker on your local machine would only be a precursor to posting on a public Docker service, unless the machine you are installing Docker on is accessible by others.
There are several ways to share Jupyter Notebooks with others:
- Place onto Google Drive
- Share on GitHub
- Store as HTML on a web server
- Install Jupyter on a web server
In order to email your notebook, the notebook must be converted to a plain text format, sent as an attachment to the recipient, and then the recipient must convert it back to the 'binary' notebook format.
Email attachments are normally converted to a well-defined MIME (Multi-purpose Internet Mail Extension) format. There is a program available that converts the notebook format, nb2mail
, which converts the notebook to a notebook MIME format. The program is available at https://github.com/nfultz/nb2mail.
Usage is as follows:
- Install
nb2mail
usingpip
command (see website) - Convert your selected notebook to MIME format
- Send to recipient
- The recipient MIME conversion process will store the file in the correct fashion (assuming they have also installed
nb2mail
)
Google Drive can be used to store your notebook profile information. This might be used when combined with the previous emailing of a notebook to another user. The recipient could use a Google Drive profile that would preclude anyone without the profile information from interacting with the notebook.
You install the python extension (from https://github.com/jupyter/jupyter-drive) using pip
and then python -m
. From then on, you access the notebooks with the Google Drive profiles, as ipython notebook -profile <profilename>
.
GitHub (and others) allow you to place a notebook on their servers that, once there, can be accessed directly using the nbviewer. The server has installed Python (and other language) coding needed to support your notebook. The nbviewer is a read-only use of your notebook, and is not interactive.
The nbviewer is available at https://github.com/jupyter/nbviewer. The site includes specific parameters which need to be added to the ipython notebook
command, such as the command to start the viewer.
A built-in feature of notebooks is to export the notebook into different formats. One of those is HTML. In this manner, you could export the notebook into HTML and copy the file(s) onto your web server as changes are made.
The command is jupyter nbconvert <notebook name>.ipynb --to html
.
Again, this would be a non-interactive, read-only version of your notebook.
Jupyter is deployed as a web application. If you have direct access to a web server, you could install Jupyter on the web server, create notebooks on that web server, and then the notebooks would be available to others that are completely dynamic.
As a web server you also have control over access to the web server so can control who can access your notebook.
This is an advanced interaction that would require working with your webmaster to determine the correct approach.
There are two aspects to security in Jupyter Notebooks:
- Making sure only specific users can access your notebook
- Making sure your notebook is not used to host malicious coding
While many of the uses of Jupyter are solely for educating others, there are instances where the information being accessed is and should remain confidential. Jupyter allows you to put up barriers to entry to your notebook in several manners.
When we identify the user, we are authenticating that user. This is normally done by presenting a login challenge before allowing entry, where the user has to enter a username and password.
If the instance of Jupyter hosting, your notebook is installed on a web server and you can use the web server's access control to limit access to your notebook. Further, most of the vendors that support notebook hosting provide a mechanism to limit access to specific users.
The other aspect of security is to make sure the contents of your notebooks are not malicious. You should make sure your notebook is safe, as follows:
- Ensure that HTML is sanitized (looking for malicious HTML coding and subverting it)
- Do not allow your notebook to execute external JavaScript
- Check cell contents that may be malicious are challenged in a server environment
- Sanitize output of cells so as not to produce unwanted effects on user machines