Index
A
- algorithms
- best practices / Best practices for algorithms
- Anaconda
- about / Anaconda
- ANOVA / Best practices for statistics
- applications and examples, predictive modelling
- about / Applications and examples of predictive modelling
- People also viewed feature, LinkedIn / LinkedIn's "People also viewed" feature, What it does?
- online ads, correct targeting / Correct targeting of online ads, How is it done?
- Santa Cruz predictive policing / Santa Cruz predictive policing
- smartphone user activity, determining / Determining the activity of a smartphone user using accelerometer data
- sport and fantasy leagues / Sport and fantasy leagues
B
- Bagging
- Bell Curve
- about / Cumulative density function
- best practices
- for coding / Best practices for coding
- for data handling / Best practices for data handling
- for algorithms / Best practices for algorithms
- for statistics / Best practices for statistics
- for business context / Best practices for business contexts
- best practices, for coding
- about / Best practices for coding
- codes, commenting / Commenting the codes
- functions, defining for substantial individual tasks / Defining functions for substantial individual tasks
- examples, of functions / Defining functions for substantial individual tasks, Example 3
- hard-coding of variables, avoiding / Avoid hard-coding of variables as much as possible
- version control / Version control
- standard libraries / Using standard libraries, methods, and formulas
- methods / Using standard libraries, methods, and formulas
- formulas / Using standard libraries, methods, and formulas
- boxplots
- business context
- best practices / Best practices for business contexts
C
- chi-square test
- about / Chi-square tests, Chi-square test
- usage / Chi-square tests
- clustering
- about / What is clustering?
- using / How is clustering used?
- cases / Why do we do clustering?
- clustering, fine-tuning
- about / Fine-tuning the clustering
- elbow method / The elbow method
- Silhouette Coefficient / Silhouette Coefficient
- clustering, implementing with Python
- about / Implementing clustering using Python
- dataset, importing / Importing and exploring the dataset
- dataset, exporting / Importing and exploring the dataset
- values in dataset, normalizing / Normalizing the values in the dataset
- hierarchical clustering, using scikit-learn / Hierarchical clustering using scikit-learn
- k-Means clustering, using scikit-learn / K-Means clustering using scikit-learn
- cluster, interpreting / Interpreting the cluster
- coding
- best practices / Best practices for coding
- contingency table
- about / Contingency tables
- creating / Contingency tables
- correlation
- about / Correlation
- correlation coefficient
- about / Correlation
- Correlation Matrix
- about / Correlation
- Cumulative Density Function
- about / Cumulative density function
- Customer Churn Model
D
- data
- versus oil / Introducing predictive modelling
- reading / Reading the data – variations and examples
- summary / Basics – summary, dimensions, and structure
- structure / Basics – summary, dimensions, and structure
- dimensions / Basics – summary, dimensions, and structure
- concatenating / Concatenating and appending data
- appending / Concatenating and appending data
- data collection
- data extraction
- Data frame
- about / Data frames
- data grouping
- about / Grouping the data – aggregation, filtering, and transformation
- illustration / Grouping the data – aggregation, filtering, and transformation
- aggregation / Aggregation
- filtering / Filtering
- transformation / Transformation
- miscellaneous operations / Miscellaneous operations
- data handling
- best practices / Best practices for data handling
- data importing, in Python
- about / Various methods of importing data in Python
- dataset, reading with read_csv method / Case 1 – reading a dataset using the read_csv method
- dataset, reading with open method / Case 2 – reading a dataset using the open method of Python
- dataset, reading from URL / Case 3 – reading data from a URL
- miscellaneous cases / Case 4 – miscellaneous cases
- dataset
- visualizing, by basic plotting / Visualizing a dataset by basic plotting
- sub-setting / Subsetting a dataset
- columns, selecting / Selecting columns
- rows, selecting / Selecting rows
- combination of rows and columns, selecting / Selecting a combination of rows and columns
- new columns, creating / Creating new columns
- merging/joining / Merging/joining datasets
- dataset, reading with open method
- about / Case 2 – reading a dataset using the open method of Python
- reading line by line / Reading a dataset line by line
- delimiter, changing / Changing the delimiter of a dataset
- decision tree
- about / Introducing decision trees, A decision tree
- using / A decision tree
- mathematics / Understanding the mathematics behind decision trees
- decision tree, implementing with scikit-learn
- about / Implementing a decision tree with scikit-learn
- tree, visualizing / Visualizing the tree
- decision tree, cross-validating / Cross-validating and pruning the decision tree
- decision tree, pruning / Cross-validating and pruning the decision tree
- delimiter
- about / Delimiters
- distance matrix
- about / The distance matrix
- distances, between two observations
- Euclidean distance / Euclidean distance
- Manhattan distance / Manhattan distance
- Minkowski distance / Minkowski distance
- dummy data frame
- generating / Generating a dummy data frame
- dummy variables
- creating / Creating dummy variables
E
- elbow method / The elbow method
- Euclidean distance
- about / Euclidean distance
F
- F-statistics
- about / F-statistics
- significance / F-statistics
G
- guidelines, for selecting predictor variables
- R2 / Summary of models
- p-values / Summary of models
- F-statistic / Summary of models
- RSE / Summary of models
- VIF / Summary of models
H
- Harvard Business Review (HBR)
- about / Introducing predictive modelling
- heteroscedasticity / Other considerations and assumptions for linear regression
- hierarchical clustering
- about / Hierarchical clustering
- histograms
- about / Histograms
- plotting / Histograms
- hypothesis testing
- about / Hypothesis testing
- null hypothesis, versus alternate hypothesis / Null versus alternate hypothesis
- Z-statistic / Z-statistic and t-statistic
- t-statistic / Z-statistic and t-statistic
- confidence intervals / Confidence intervals, significance levels, and p-values
- significance levels / Confidence intervals, significance levels, and p-values
- p-values / Confidence intervals, significance levels, and p-values
- types / Different kinds of hypothesis test
- step-by-step guide / A step-by-step guide to do a hypothesis test
- example / An example of a hypothesis test
- hypothesis tests
- left-tailed / Different kinds of hypothesis test
- right-tailed / Different kinds of hypothesis test
- two-tailed / Different kinds of hypothesis test
I
- IDEs, for Python
- about / IDEs for Python
- IDLE / IDEs for Python
- IPython Notebook / IDEs for Python
- Spyder / IDEs for Python
- IDLE
- about / IDEs for Python
- features / IDEs for Python
- Inner Join
- characteristics / Inner Join
- about / Inner Join
- example / An example of the Inner Join
- Inter Quartile Range(IQR) / Handling outliers
- intra-cluster distance / The elbow method
- IPython
- IPython Notebook
- about / IDEs for Python
- features / IDEs for Python
- issues handling, in linear regression
- about / Handling other issues in linear regression
- categorical variables, handling / Handling categorical variables
- variable, transforming to fit non-linear relations / Transforming a variable to fit non-linear relations
- outliers, handling / Handling outliers
J
- joins
- summarizing / Summary of Joins in terms of their length
K
- k-Means clustering
- about / K-means clustering
- knowledge matrix, predictive modelling
L
- left-tailed test
- Left Join
- characteristics / Left Join
- about / Left Join
- example / An example of the Left Join
- Likelihood Ratio Test statistic
- about / Likelihood Ratio Test statistic
- linear regression
- issues, handling / Handling other issues in linear regression
- considerations / Other considerations and assumptions for linear regression
- assumptions / Other considerations and assumptions for linear regression
- versus logistic regression / Linear regression versus logistic regression
- linear regression, implementing with Python
- about / Implementing linear regression with Python
- statsmodel library, using / Linear regression using the statsmodel library
- multiple linear regression / Multiple linear regression
- multi-collinearity / Multi-collinearity
- Variance Inflation Factor (VIF) / Variance Inflation Factor
- linkage methods
- about / Linkage methods
- single linkage / Single linkage
- compete linkage / Compete linkage
- average linkage / Average linkage
- centroid linkage / Centroid linkage
- Ward's method / Ward's method
- logistic regression
- logistic regression, with Python
- implementing / Implementing logistic regression with Python
- data, processing / Processing the data
- data exploration / Data exploration
- data visualization / Data visualization
- dummy variables, creating for categorical variables / Creating dummy variables for categorical variables
- feature selection / Feature selection
- model, implementing / Implementing the model
- logistic regression model
- validation / Model validation and evaluation
- evaluation / Model validation and evaluation
- cross validation / Cross validation
- logistic regression parameters
- about / Making sense of logistic regression parameters
- Wald test / Wald test
- Likelihood Ratio Test statistic / Likelihood Ratio Test statistic
- chi-square test / Chi-square test
M
- Manhattan distance
- about / Manhattan distance
- math, behind logistic regression
- about / Understanding the math behind logistic regression
- contingency tables / Contingency tables
- conditional probability / Conditional probability
- odds ratio / Odds ratio
- moving to logistic regression / Moving on to logistic regression from linear regression
- estimation, using Maximum Likelihood Method / Estimation using the Maximum Likelihood Method, Log likelihood function:
- logistic regression model, building from scratch / Building the logistic regression model from scratch
- mathematics, behind clustering
- about / Mathematics behind clustering
- distances, between two observations / Distances between two observations
- distance matrix / The distance matrix
- distances, normalizing / Normalizing the distances
- linkage methods / Linkage methods
- hierarchical clustering / Hierarchical clustering
- k-Means clustering / K-means clustering
- mathematics, decision tree
- homogeneity / Homogeneity
- entropy / Entropy
- information gain / Information gain
- ID3 algorithm / ID3 algorithm to create a decision tree
- Gini index / Gini index
- Reduction in Variance / Reduction in Variance
- tree, puring / Pruning a tree
- continuous numerical variable, handling / Handling a continuous numerical variable
- missing value of attribute, handling / Handling a missing value of an attribute
- maths, behind linear regression
- about / Understanding the maths behind linear regression
- simulated data, using / Linear regression using simulated data
- linear regression model, fitting / Fitting a linear regression model and checking its efficacy
- linear regression model efficacy, checking / Fitting a linear regression model and checking its efficacy
- optimum value of variable coefficients, finding / Finding the optimum value of variable coefficients
- matplotlib
- miles per gallon (mpg) / Transforming a variable to fit non-linear relations
- Minkowski distance
- about / Minkowski distance
- miscellaneous cases, data reading
- reading, from .xls or .xlsx file / Reading from an .xls or .xlsx file
- CSV or Excel file, writing to / Writing to a CSV or Excel file
- missing values
- handling / Handling missing values
- checking for / Checking for missing values
- about / What constitutes missing data?
- generating / How missing values are generated and propagated
- propagating / How missing values are generated and propagated
- treating / Treating missing values
- deletion / Deletion
- imputation / Imputation
- model validation
- about / Model validation, Model validation
- data split, training / Training and testing data split
- data split, testing / Training and testing data split
- models, summarizing / Summary of models
- guidelines, for selecting variables / Summary of models
- linear regression with scikit-learn / Linear regression with scikit-learn
- feature selection, with scikit-learn / Feature selection with scikit-learn
- Monte-Carlo simulation
- for finding value of pi / Using the Monte-Carlo simulation to find the value of pi
- multi-collinearity
- about / Multi-collinearity
N
- normal distribution
- about / Normal distribution
- null hypothesis
- versus alternate hypothesis / Null versus alternate hypothesis
- NumPy
O
- outliers
- about / Handling outliers
- handling / Handling outliers
P
- p-values
- about / p-values
- pandas
- parameters, random forest
- node size / Important parameters for random forests
- number of trees / Important parameters for random forests
- number of predictors sampled / Important parameters for random forests
- pip
- installing / Installing pip
- predictive analytics
- about / Introducing predictive modelling
- predictive modelling
- about / Introducing predictive modelling
- scope / Scope of predictive modelling
- statistical algorithms / Ensemble of statistical algorithms
- statistical tools / Statistical tools
- historical data / Historical data
- mathematical function / Mathematical function
- business context / Business context
- knowledge matrix / Knowledge matrix for predictive modelling
- task matrix / Task matrix for predictive modelling
- applications and examples / Applications and examples of predictive modelling
- predictor variables
- about / Multiple linear regression
- forward selection approach / Multiple linear regression
- backward selection approach / Multiple linear regression
- Probability Density Function
- about / Probability density function
- probability distributions
- about / Generating random numbers following probability distributions
- Probability Density Function / Probability density function
- Cumulative Density Function / Cumulative density function
- Python packages
- about / Python and its packages – download and installation
- Anaconda / Anaconda
- Standalone Python / Standalone Python
- installing / Installing a Python package
- installing, with pip / Installing Python packages with pip
- Python packages, for predictive modelling
- about / Python and its packages for predictive modelling
- pandas / Python and its packages for predictive modelling
- NumPy / Python and its packages for predictive modelling
- matplotlib / Python and its packages for predictive modelling
- IPython / Python and its packages for predictive modelling
- scikit-learn / Python and its packages for predictive modelling
R
- random forest
- implementing, using Python / Implementing a random forest using Python
- features / Why do random forests work?
- parameters / Important parameters for random forests
- random forest algorithm
- about / The random forest algorithm
- random forests
- random numbers
- about / Generating random numbers and their usage
- generating / Generating random numbers and their usage
- usage / Generating random numbers and their usage
- methods, for generating / Various methods for generating random numbers
- seeding / Seeding a random number
- generating, following probability distributions / Generating random numbers following probability distributions
- random sampling
- about / Random sampling – splitting a dataset in training and testing datasets
- dataset, testing / Random sampling – splitting a dataset in training and testing datasets
- dataset, splitting / Random sampling – splitting a dataset in training and testing datasets
- Customer Churn Model, using / Method 1 – using the Customer Churn Model
- sklearn, using / Method 2 – using sklearn
- shuffle function, using / Method 3 – using the shuffle function
- and central limit theorem / Random sampling and the central limit theorem
- read_csv method
- about / Case 1 – reading a dataset using the read_csv method, The read_csv method
- filepath / The read_csv method
- sep / The read_csv method
- dtype / The read_csv method
- header / The read_csv method
- names / The read_csv method
- skiprows / The read_csv method
- index_col / The read_csv method
- skip_blank_lines / The read_csv method
- na-filter / The read_csv method
- use cases / Use cases of the read_csv method
- Receiver Operating Characteristic (ROC) curve
- about / Model validation
- Recursive Feature Elimination (RFE) / Feature selection with scikit-learn
- regression tree algorithm
- about / Regression tree algorithm
- regression trees
- about / Understanding and implementing regression trees
- advantages / Regression tree algorithm
- implementing, with Python / Implementing a regression tree using Python
- Residual Standard Error (RSE)
- about / Residual Standard Error
- result parameters
- about / Making sense of result parameters
- p-values / p-values
- F-statistics / F-statistics
- Residual Standard Error (RSE) / Residual Standard Error
- retrospective analytics
- about / Introducing predictive modelling
- right-tailed test
- Right Join
- about / Right Join
- characteristics / Right Join
- example / An example of the Right Join
- ROC curve
- about / The ROC curve
- confusion matrix / Confusion matrix
S
- scatter plot
- about / Scatter plots
- plotting / Scatter plots
- scikit-learn
- Sensitivity (True Positive Rate) / The ROC curve
- shuffle function
- Silhouette Coefficient / Silhouette Coefficient
- sklearn
- using / Method 2 – using sklearn
- Specificity (True Negative Rate) / The ROC curve
- Spyder
- about / IDEs for Python
- features / IDEs for Python
- Standalone Python
- about / Standalone Python
- statistical algorithms, predictive modelling
- about / Ensemble of statistical algorithms
- supervised algorithms / Ensemble of statistical algorithms
- un-supervised algorithms / Ensemble of statistical algorithms
- statistics
- best practices / Best practices for statistics
T
- t-statistic
- about / Z-statistic and t-statistic
- t-test / Best practices for statistics
- t-test (Student-t distribution)
- about / Z-statistic and t-statistic
- task matrix, predictive modelling
- two-tailed test
U
- uniform distribution
- about / Uniform distribution
- use cases, read_csv method
- about / Use cases of the read_csv method
- directory address and filename, passing as variables / Passing the directory address and filename as variables
- .txt dataset, reading with comma delimiter / Reading a .txt dataset with a comma delimiter
- dataset column names, specifying from list / Specifying the column names of a dataset from a list
V
- value of pi
- calculating / Geometry and mathematics behind the calculation of pi
- Variance Inflation Factor (VIF)
- about / Variance Inflation Factor
W
- Wald test / Wald test
Z
- Z-statistic
- about / Z-statistic and t-statistic
- Z-test / Best practices for statistics
- Z- test (normal distribution)
- about / Z-statistic and t-statistic