Index
A
- actions, DataFrame
- about / Overview of DataFrame actions
- .show(...) action / The .show(...) action
- .collect() action / The .collect() action
- .take(...) action / The .take(...) action
- .toPandas() action / The .toPandas() action
- reference / See also
- Advanced Packaging Tool (APT) / How it works...
- Anaconda
- reference / Installing Python
- analysis of variance (ANOVA) / How to do it...
- apt-get tool / How it works...
B
- benchmark logistic regression model
- forest coverage types, predicting / Predicting forest coverage types, How it works..., There's more...
- binaries
- Spark, installing / Installing Spark from binaries, How to do it..., How it works...
- Bokeh
- references / There's more...
- breadth first search (BFS) algorithm / Finding the fewest number of connections
C
- .collect() action / The .collect() action
- Catalyst Optimizer / Introduction
- categorical features
- exploring / Categorical features
- reference / See also...
- census respondents
- hours of work, predicting / Predicting hours of work for census respondents, How it works...
- income levels, forecasting / Forecasting the income levels of census respondents, How it works..., There's more...
- chi-square test
- reference / How it works...
- classification
- about / How to do it...
- Linear discriminant analysis (LDA) / How to do it...
- LinearSVC / Introducing Estimators
- LogisticRegression / Introducing Estimators
- DecisionTreeClassifier / Introducing Estimators
- GBTClassifier / Introducing Estimators
- RandomForestClassifier / Introducing Estimators
- NaiveBayes / Introducing Estimators
- MultilayerPerceptronClassifier / Introducing Estimators
- OneVsRest / Introducing Estimators
- classification metrics
- reference / How it works...
- Cloudera
- about / Working with Cloudera Spark images
- virtual image, using / Working with Cloudera Spark images
- Cloudera QuickStart VM
- reference / How to do it...
- downloading / How to do it...
- configuring / How it works...
- clustering
- BisectingKMeans / Introducing Estimators
- Kmeans / Introducing Estimators
- GaussianMixture / Introducing Estimators
- Latent Dirichlet Allocation (LDA) / Introducing Estimators
- about / Clustering forest cover types
- reference / See also
- clustering models
- building / Building a clustering models, There's more...
- collinearity / Selecting the most predictable features
- configuration options, Spark
- reference / See also
- continuous aggregation
- with structured streaming / Continuous aggregation with structured streaming, How to do it..., How it works...
- continuous variables
- discretizing / Discretizing continuous variables, How it works...
- standardizing / Standardizing continuous variables, How it works...
- correlation methods
- reference / There's more...
- correlations
- computing / Computing correlations, There's more...
- CSV
- data, reading from / From CSV
D
- .describe() transformation / The .summary() and .describe() transformations
- .distinct(...) transformation / The .distinct(...) transformation
- .dropDuplicates(...) transformation / The .dropDuplicates(...) transformation
- .dropna(...) transformation / The .dropna(...) transformation
- D3.js visualization
- reference / Visualizing the graph
- data
- reading, from files / Reading data from files, How to do it...
- loading / Loading the data, How it works..., There's more...
- exploring / Exploring the data, How it works..., There's more...
- testing / Testing the data, Getting ready, How it works...
- transforming / Transforming the data, How it works..., There's more...
- standardizing / Standardizing the data, How it works...
- Databricks
- reference / Introduction, How to do it...
- data formats
- reference / See also
- DataFrame
- creating / Creating DataFrames, How it works...
- data, reading from JSON / From JSON
- data, reading from CSV / From CSV
- used, for performance optimization / Performance optimizations, Getting ready, How it works..., There's more...
- interacting, with SQL / Using SQL to interact with DataFrames, How it works..., There's more...
- transformations / Overview of DataFrame transformations, How to do it...
- actions / Overview of DataFrame actions
- data splitting
- reference / See also
- descriptive statistics
- exploring / Exploring descriptive statistics, How it works...
- exploring, for aggregated columns / Descriptive statistics for aggregated columns
- reference link / See also
- Discrete Cosine Transform (DCT) / Introducing Transformers
- Discretized Streams (DStreams) / Understanding Spark Streaming, Understanding DStreams
- duplicated records, in full_removed DataFrame
- different IDs / Only IDs differ
- ID collisions / ID collisions
- duplicates
- handling / Handling duplicates, How it works...
E
- estimators
- about / Introducing Estimators
- linear SVM model / Introducing Estimators
- linear regression model / Introducing Estimators
- classification / Introducing Estimators
- regression / Introducing Estimators
- clustering / Introducing Estimators
- evaluation of clustering
- reference / See also
F
- .fillna(...) transformation / The .fillna(...) transformation
- .filter(...) transformation / The .filter(...) transformation
- .freqItems(...) transformation / The .freqItems(...) transformation
- features
- interactions, visualizing / Visualizing interactions between features, How it works..., There's more...
- extracting, from text / Extracting features from text, How it works..., There's more...
- features, regression
- Pearson's correlation / How to do it...
- analysis of variance (ANOVA) / How to do it...
- feature selection
- reference / See also
- files
- data, reading / Reading data from files, How to do it...
- .textFile(...) method / .textFile(...) method
- .map(...) method / .map(...) method
- performance / Partitions and performance
- partitions / Partitions and performance
- forest coverage types
- predicting, with benchmark logistic regression model / Predicting forest coverage types, How it works..., There's more...
- predicting, with random forest classifier / Predicting forest coverage types, How it works..., There's more...
- clustering / Clustering forest cover types, How it works...
- forest elevation
- predicting, with linear SVM model / Getting ready, How it works..., There's more...
- predicting, with linear regression model / Getting ready, How it works..., There's more...
- estimating, with gradient-boosted trees regressor / Estimating forest elevation, How it works..., There's more...
- estimating with random forest regression model / Estimating forest elevation, How it works..., There's more...
G
- .groupBy(...) transformation / The .groupBy(...) transformation
- Generalized Linear Model (GLM) / Introducing Estimators
- global aggregations
- about / Understanding global aggregations
- implementing / Understanding global aggregations, How it works...
- Netcat window / Terminal 1 – Netcat window
- Spark Streaming window / Terminal 2 – Spark Streaming window
- Gradient-Boosted Trees (GBT) / Introducing Estimators
- gradient-boosted trees regressor
- forest elevation, estimating / Estimating forest elevation, How it works..., There's more...
- gradient descent
- reference / How it works...
- graph
- about / Introduction
- usage / Introduction
- data, preparing / Preparing the data, How to do it..., How it works..., There's more...
- building / Building the graph, How to do it..., How it works...
- queries, executing / Running queries against the graph, How to do it..., How it works...
- patterns, understanding with motifs / Understanding the graph, How it works...
- fewest number of connections, searching / Finding the fewest number of connections, How it works..., There's more...
- visualizing / Visualizing the graph, How to do it..., How it works...
- GraphFrames
- installing / Installing GraphFrames, Getting ready, How it works...
- improvements / Installing GraphFrames
- reference / Installing GraphFrames, How to do it...
- GraphFrames Spark package
- reference / How it works...
H
- histograms
- drawing / Drawing histograms, How it works..., There's more...
- reference / See also
- hyperparameters
- tuning / Tuning hyperparameters, How it works..., There's more...
I
- internal field separator (IFS) / How it works...
- interquartile range / How to do it...
- Inverse Document Frequency (IDF) / Introducing Transformers
J
- .join(...) transformation / The .join(...) transformation
- Java
- installing / Installing Java
- reference / Installing Java
- JSON
- data, reading from / From JSON
- Jupyter
- installing / Installing Jupyter, How it works..., There's more..., See also
- reference / How it works...
- session, configuring / Configuring a session in Jupyter, Getting ready, How it works..., There's more...
- Jupyter kernel
- reference / How it works...
- Jupyter Notebook / How it works...
K
- kernel
- about / How it works...
- reference / How it works...
L
- Latent Dirichlet Allocation (LDA) / Introducing Estimators, How it works...
- Linear discriminant analysis (LDA) / How to do it...
- linear regression model
- for predicting forest elevation / Getting ready, How it works..., There's more...
- linear SVM model
- for predicting forest elevation / Getting ready, How it works..., There's more...
- Livy REST API
- reference / See also
- local instance, Spark
- configuring / Configuring a local instance of Spark, How it works...
- spark.app.name parameter / How it works...
- spark.driver.cores parameter / How it works...
- spark.driver.memory parameter / How it works...
- spark.executor.cores parameter / How it works...
- spark.executor.memory parameter / How it works...
- spark.submit.pyFiles parameter / How it works...
- spark.submit.deployMode parameter / How it works...
- spark.pyspark.python parameter / How it works...
M
- machine learning (ML)
- about / Transforming the data
- reference / How it works...
- machine learning algorithms
- evaluation metrics, reference / See also
- Matplotlib
- reference link / How it works...
- Maven
- reference / Installing Maven
- installing / Installing Maven
- missing observations
- handling / Handling missing observations, How to do it..., There's more...
- missing observations per column
- handling / Missing observations per column
- missing observations per row
- handling / Missing observations per row
- motifs
- used, for understanding patterns in graph / Understanding the graph, How it works...
- multi-node instance, Spark
- configuring / Configuring a multi-node instance of Spark, Getting ready, How to do it..., How it works...
- multicollinearity
- reference / There's more...
N
- Netcat
- starting / Terminal 1 – Netcat window
- numerical features
- exploring / Numerical features
O
- .orderBy(...) transformation / The .orderBy(...) transformation
- on-time flight performance data
- reference / Preparing the data
- outliers
- handling / How it works...
- reference link / See also
P
- PageRank
- about / Using PageRank to determine airport ranking
- reference / Using PageRank to determine airport ranking, How it works...
- used, for determining airport ranking / Using PageRank to determine airport ranking, How it works...
- PATH
- updating / Updating PATH
- Pearson's correlation / How to do it...
- performance statistics
- computing / Computing performance statistics
- computing, with regression metrics / Regression metrics
- computing, with classification metrics / Classification metrics
- pip
- reference / There's more...
- about / There's more..., How it works...
- Pipelines
- about / Introducing Pipelines
- using / How to do it..., How it works...
- reference / See also
- Precision-Recall (PR) / Classification metrics
- predictable features
- selecting / Selecting the most predictable features, How it works..., There's more...
- correlations, checking / There's more...
- problems, dataset
- duplicated observations / Introduction
- missing observations / Introduction
- aanomalous observations / Introduction
- encoding / Introduction
- untrustworthy answers / Introduction
- PySpark
- installing, with pip / There's more...
- pyspark.sql module
- reference / How it works...
- Python
- installing / Installing Python
- Python Package Index (PyPI)
- about / How it works...
- reference / See also
R
- .repartition(...) transformation / The .repartition(...) transformation
- R
- reference / Installing R
- installing / Installing R
- random forest classifier
- forest coverage types, predicting / Predicting forest coverage types, How it works..., There's more...
- random forest regression model
- forest elevation, estimating / Estimating forest elevation, How it works..., There's more...
- RDD actions
- overview / Overview of RDD actions, How to do it...
- .take(...) action / .take(...) action
- .collect() action / .collect() action
- .reduce(...) action / .reduce(...) action
- .count() action / .count() action
- .saveAsTextFile(...) action / .saveAsTextFile(...) action
- implementing / How it works...
- RDD transformations
- overview / Overview of RDD transformations, How to do it...
- reference / How to do it...
- .map(...) transformation / .map(...) transformation
- .filter(...) transformation / .filter(...) transformation
- .flatMap(...) transformation / .flatMap(...) transformation
- .distinct() transformation / .distinct() transformation
- .sample(...) transformation / .sample(...) transformation
- .join(...) transformation / .join(...) transformation
- .repartition(...) transformation / .repartition(...) transformation
- .zipWithIndex() transformation / .zipWithIndex() transformation
- .reduceByKey(...) transformation / .reduceByKey(...) transformation
- .sortByKey(...) transformation / .sortByKey(...) transformation
- .union(...) transformation / .union(...) transformation
- .mapPartitionsWithIndex(...) transformation / .mapPartitionsWithIndex(...) transformation
- implementing / How it works...
- Receiver Operating Characteristics (ROC) / Classification metrics
- redirection pipes
- reference / How it works...
- reflection
- schema, inferring / Inferring the schema using reflection, How it works..., See also
- regression
- about / How to do it...
- AFTSurvivalRegression / Introducing Estimators
- DecisionTreeRegressor / Introducing Estimators
- GBTRegressor / Introducing Estimators
- GeneralizedLinearRegression / Introducing Estimators
- IsotonicRegression / Introducing Estimators
- LinearRegression / Introducing Estimators
- RandomForestRegressor / Introducing Estimators
- regression metrics
- reference / How it works...
- Resilient Distributed Datasets (RDDs)
- about / Introduction
- creating / Creating RDDs, How it works...
- Spark context parallelize method / Spark context parallelize method
- .take(...) method / .take(...) method
- pitfalls / Pitfalls of using RDDs, How to do it..., How it works...
- accessing / Accessing underlying RDDs, How it works...
- creating, for training / Creating an RDD for training, How it works..., There's more...
- for classification / Classification
- for regression / Regression
S
- .select(...) transformation / The .select(...) transformation
- .show(...) action / The .show(...) action
- .summary() transformation / The .summary() and .describe() transformations
- Scala
- installing / Installing Scala
- schema
- inferring, with reflection / Inferring the schema using reflection, How it works..., See also
- reference / See also, See also
- specifying, programmatically / Specifying the schema programmatically, How it works..., See also
- silhouette metrics
- reference / How it works...
- skip-gram model
- reference / There's more...
- Spark
- about / Introduction
- features / Introduction
- requisites, installing / Installing Spark requirements, Getting ready, How it works...
- reference / How it works..., See also
- Java, installing / Installing Java
- Python, installing / Installing Python
- R, installing / Installing R
- Scala, installing / Installing Scala
- Maven, installing / Installing Maven
- PATH, updating / Updating PATH
- installing, from sources / Installing Spark from sources, How to do it..., How it works..., See also
- installing, from binaries / Installing Spark from binaries, How to do it..., How it works...
- local instance, configuring / Configuring a local instance of Spark, How it works...
- multi-node instance, configuring / Configuring a multi-node instance of Spark, Getting ready, How to do it..., How it works...
- Spark DataFrame / Creating DataFrames
- sparkmagic package
- installing / How it works...
- reference / How it works..., See also
- Spark Streaming
- about / Understanding Spark Streaming
- reference / How to do it...
- Netcat window, using / Terminal 1 – Netcat window
- PySpark Streaming application, creating / Terminal 2 – Spark Streaming window
- console application, creating / How it works..., There's more...
- Spark Streaming Context (SSC) / Understanding Spark Streaming
- SQL
- used, for interacting with DataFrame / Using SQL to interact with DataFrames, How it works..., There's more...
- statistical test
- reference / See also...
- stop words
- reference / How it works...
- structured streaming
- continuous aggregation / Continuous aggregation with structured streaming, How to do it..., How it works...
- Netcat window / Terminal 1 – Netcat window
- Spark Streaming window / Terminal 2 – Spark Streaming window
- SVM ( Support Vector Machine) / Forecasting the income levels of census respondents
T
- .take(...) action / The .take(...) action
- .toPandas() action / The .toPandas() action
- temporary table
- creating / Creating a temporary table, How it works..., There's more...
- term frequency-inverse document frequency (TF-IDF) / How it works...
- text
- features, extracting / Extracting features from text, How it works..., There's more...
- topic
- assigning, to set of short paragraphs / Topic mining, How it works...
- train-validation split / There's more...
- transformations, DataFrame
- about / Overview of DataFrame transformations, How to do it...
- .select(...) transformation / The .select(...) transformation
- .filter(...) transformation / The .filter(...) transformation
- .groupBy(...) transformation / The .groupBy(...) transformation
- .orderBy(...) transformation / The .orderBy(...) transformation
- .withColumn(...) transformation / The .withColumn(...) transformation
- .join(...) transformation / The .join(...) transformation
- .unionAll(...) transformation / The .unionAll(...) transformation
- .distinct(...) transformation / The .distinct(...) transformation
- .repartition(...) transformation / The .repartition(...) transformation
- .fillna(...) transformation / The .fillna(...) transformation
- .dropna(...) transformation / The .dropna(...) transformation
- .dropDuplicates(...) transformation / The .dropDuplicates(...) transformation
- .describe() transformation / The .summary() and .describe() transformations
- .summary() transformation / The .summary() and .describe() transformations
- .freqItems(...) transformation / The .freqItems(...) transformation
- reference / See also
- transformers
- about / Introducing Transformers
- Binarizer / Introducing Transformers
- Bucketizer / Introducing Transformers
- ChiSqSelector / Introducing Transformers
- CountVectorizer / Introducing Transformers
- DCT / Introducing Transformers
- ElementwiseProduct / Introducing Transformers
- HashingTF / Introducing Transformers
- IDF / Introducing Transformers
- IndexToString / Introducing Transformers
- MaxAbsScaler / Introducing Transformers
- MinMaxScaler / Introducing Transformers
- NGram / Introducing Transformers
- Normalizer / Introducing Transformers
- OneHotEncoder / Introducing Transformers
- PCA / Introducing Transformers
- PolynomialExpansion / Introducing Transformers
- QuantileDiscretizer / Introducing Transformers
- RegexTokenizer / Introducing Transformers
- RFormula / Introducing Transformers
- SQLTransformer / Introducing Transformers
- StandardScaler / Introducing Transformers
- StopWordsRemover / Introducing Transformers
- StringIndexer / Introducing Transformers
- Tokenizer / Introducing Transformers
- VectorAssembler / Introducing Transformers
- VectorIndexer / Introducing Transformers
- VectorSlicer / Introducing Transformers
- Word2Vec / Introducing Transformers
- using / Getting ready, How it works...
- .VectorAssembler(...) method / There's more...
- reference / See also
U
- .unionAll(...) transformation / The .unionAll(...) transformation
- User Defined Functions (UDFs) / Performance optimizations
V
- vectorized UDFs
- reference / See also
- VirtualBox
- reference / Getting ready
- installation / Getting ready
W
- .withColumn(...) transformation / The .withColumn(...) transformation
- Word2Vec
- reference / There's more...