Index
A
- accumulate
- Alternating Least Square (ALS) algorithm
- about / Recommendation
- reference link / Recommendation
- Amazon Machine Images (AMI) / Running Spark on EC2 with the scripts
- architecture, Spark SQL
- about / The Spark SQL architecture
B
- basic statistics, Spark MLlib examples
- about / Basic statistics
- broadcast
C
- Chef
- about / Deploying Spark with Chef (Opscode)
- URL / Deploying Spark with Chef (Opscode)
- Spark, deploying with / Deploying Spark with Chef (Opscode)
- URL, for cookbook / Deploying Spark with Chef (Opscode)
- classification, Spark MLlib examples
- about / Classification
- clustering, Spark MLlib examples
- about / Clustering
- code testable
- making / Making your code testable
- commands, quick start
- community developed packages
- about / Community developed packages
- concurrency, limitations
- about / Concurrency limitations
- memory usage, and garbage collection / Memory usage and garbage collection
- serialization / Serialization
- IDE integration / IDE integration
- custom serializers
- references / Serialization
D
- data
- loading, from S3 / Interactively loading data from S3
- loading, into RDD / Loading data into an RDD
- saving / Saving your data
- datafiles, GitHub
- reference link / SQL access to a simple data table
- directory
- organization / Directory organization and convention
- convention / Directory organization and convention
- references / Directory organization and convention
- doctest / Testing in Python
- double RDD functions
- about / Double RDD functions
- sampleStdev / Double RDD functions
- Stats / Double RDD functions
- Stdev / Double RDD functions
- Sum / Double RDD functions
- variance / Double RDD functions
E
- EC2
- Spark, running on / Running Spark on EC2, Running Spark on EC2 with the scripts
- EC2 command line tools
- references / Running Spark on EC2 with the scripts
- EC2 scripts, Amazon
- Elastic MapReduce (EMR)
- Spark, deploying on / Deploying Spark on Elastic MapReduce
- ENhanced Scala Interaction Mode for Emacs
- ENSIME / IDE integration
- ENSIME
- URL / IDE integration
F
- files
- saving, to Parquet / Saving files to the Parquet format
- loading, to Parquet / Loading Parquet files
- flatMap function
- functions, for joining Pair RDDs
- about / Functions for joining PairRDDs
- coGroup / Functions for joining PairRDDs
- join / Functions for joining PairRDDs
- subtractKey / Functions for joining PairRDDs
- functions, on JavaPairRDDs
- about / Functions on JavaPairRDDs
- cogroup / Functions on JavaPairRDDs
- combineByKey / Functions on JavaPairRDDs
- collectAsMap / Functions on JavaPairRDDs
- countByKey / Functions on JavaPairRDDs
- flatMapValues / Functions on JavaPairRDDs
- join / Functions on JavaPairRDDs
- keys / Functions on JavaPairRDDs
- lookup / Functions on JavaPairRDDs
- reduceByKey / Functions on JavaPairRDDs
- sortByKey / Functions on JavaPairRDDs
- values / Functions on JavaPairRDDs
G
- general RDD functions
- about / General RDD functions
- aggregate / General RDD functions
- cache / General RDD functions
- collect / General RDD functions
- count / General RDD functions
- countByValue / General RDD functions
- distinct / General RDD functions
- filter / General RDD functions
- filterWith / General RDD functions
- first / General RDD functions
- flatMap / General RDD functions
- fold / General RDD functions
- foreach / General RDD functions
- groupBy / General RDD functions
- keyBy / General RDD functions
- map / General RDD functions
- mapPartitions / General RDD functions
- mapPartitionsWithIndex / General RDD functions
- mapWith / General RDD functions
- persist / General RDD functions
- pipe / General RDD functions
- sample / General RDD functions
- takeSample / General RDD functions
- toDebugString / General RDD functions
- union / General RDD functions
- unpersist / General RDD functions
- zip / General RDD functions
- GitHub repository
- reference link, for data files / Spark MLlib examples
H
- HBase
- about / HBase
- data, loading / Loading from HBase
- data, saving / Saving to HBase
- metadata, obtaining / Other HBase operations
I
- Impala
- Parquet files, querying / Querying Parquet files with Impala
- interactions
- testing, with SparkContext / Testing interactions with SparkContext
J
- Java
- SparkContext object, creating in / Java
- RDD, manipulating in / Manipulating your RDD in Scala and Java
- using, as testing library / Testing in Java and Scala
- Java RDD functions
- about / Java RDD functions, Common Java RDD functions
- Spark Java function classes / Spark Java function classes
- common Java RDD functions / Common Java RDD functions
- cache / Common Java RDD functions
- coalesce / Common Java RDD functions
- collect / Common Java RDD functions
- count / Common Java RDD functions
- countByValue / Common Java RDD functions
- distinct / Common Java RDD functions
- filter / Common Java RDD functions
- first / Common Java RDD functions
- flatMap / Common Java RDD functions
- fold / Common Java RDD functions
- foreach / Common Java RDD functions
- groupBy / Common Java RDD functions
- map / Common Java RDD functions
- mapPartitions / Common Java RDD functions
- reduce / Common Java RDD functions
- sample / Common Java RDD functions
L
- lambda
- latest development source, Spark
- references / Downloading the source
- linear regression, Spark MLlib examples
- about / Linear regression
- Logistic regression
- running, Spark shell used / Using the Spark shell to run logistic regression
- logs
- finding / Where to find logs
M
- mailing lists
- about / Mailing lists
- references / Mailing lists
- map
- map function
- massively parallel processing (MPP)
- Maven
- Spark job, building with / Building your Spark job with Maven
- maven installation instructions
- references / Compiling the source with Maven
- Mesos
- about / Deploying Spark on Mesos
- Spark, deploying on / Deploying Spark on Mesos
- URL / Deploying Spark on Mesos
- metadata, SparkContext object / SparkContext – metadata
- appName / SparkContext – metadata
- getConf / SparkContext – metadata
- getExecutorMemoryStatus / SparkContext – metadata
- Master / SparkContext – metadata
- Version / SparkContext – metadata
- about / SparkContext – metadata
- methods, for combining JavaRDDs
- about / Methods for combining JavaRDDs
- subtract / Methods for combining JavaRDDs
- union / Methods for combining JavaRDDs
- zip / Methods for combining JavaRDDs
- multiple tables
- handling, with Spark SQL / Handling multiple tables with Spark SQL
N
- nondata driven methods, SparkContext object
- addJar(path) / Shared Java and Scala APIs
- addFile(path) / Shared Java and Scala APIs
- stop() / Shared Java and Scala APIs
- clearFiles() / Shared Java and Scala APIs
- clearJars() / Shared Java and Scala APIs
P
- package index site
- reference link / Community developed packages
- pair RDD functions
- about / PairRDD functions
- collectAsMap / PairRDD functions
- reduceByKey / PairRDD functions
- countByKey / PairRDD functions
- join / PairRDD functions
- rightOuterJoin / PairRDD functions
- leftOuterJoin / PairRDD functions
- combineByKey / PairRDD functions
- zip / PairRDD functions
- groupByKey / PairRDD functions
- cogroup / PairRDD functions
- PairRDD functions
- about / Other PairRDD functions
- lookup / Other PairRDD functions
- mapValues / Other PairRDD functions
- collectAsMap / Other PairRDD functions
- countByKey / Other PairRDD functions
- partitionBy / Other PairRDD functions
- flatMapValues / Other PairRDD functions
- Parquet
- about / Parquet – an efficient and interoperable big data format
- files, saving / Saving files to the Parquet format
- files, loading / Loading Parquet files
- processed RDD, saving / Saving processed RDD in the Parquet format
- Parquet files
- querying, with Impala / Querying Parquet files with Impala
- Personal Package Archive (PPA) / Building your Spark project with sbt
- prebuilt distribution
- installing / Installing prebuilt distribution
- processed RDD
- saving, in Parquet / Saving processed RDD in the Parquet format
- PySpark / Testing in Python
- Python
- Spark shell, running in / Running Spark shell in Python
- SparkContext object, creating in / Python
- RDD, manipulating in / Manipulating your RDD in Python
- Python testing, of Spark / Testing in Python
Q
- QuickStart VM
R
- RDD
- about / RDDs
- data, loading into / Loading data into an RDD
- manipulating, in Scala / Manipulating your RDD in Scala and Java
- manipulating, in Java / Manipulating your RDD in Scala and Java
- manipulating, in Python / Manipulating your RDD in Python
- references / PairRDD functions
- recommendation, Spark MLlib examples
- about / Recommendation
- reference link / Recommendation
- reduce
- Resilient Distributed Dataset (RDD) / Spark topology
- Resilient Distributed Datasets (RDD) / Loading a simple text file
- Run Length Encoding (RLE)
S
- S3
- data, loading from / Interactively loading data from S3
- sbt
- Spark project, building with / Building your Spark project with sbt
- Scala
- SparkContext object, creating in / Scala
- RDD, manipulating in / Manipulating your RDD in Scala and Java
- Scala APIs / Shared Java and Scala APIs
- Scala RDD functions
- about / Scala RDD functions
- foldByKey / Scala RDD functions
- reduceByKey / Scala RDD functions
- groupByKey / Scala RDD functions
- ScalaTest
- using, as testing library / Testing in Java and Scala
- security
- about / A quick note on security
- shared Java APIs / Shared Java and Scala APIs
- simple text file
- loading / Loading a simple text file
- single machine
- about / A single machine
- source
- Spark, building from / Building Spark from source
- spam dataset, GitHub link
- Spark
- URL, for downloading / Installing prebuilt distribution
- building, from source / Building Spark from source
- URL, for building from source / Building Spark from source
- URL, for downloading latest source / Downloading the source
- installation, testing / Testing the installation
- running, on EC2 / Running Spark on EC2
- reference link, for latest onrunning spark on EC2 / Running Spark on EC2
- running on EC2, with scripts / Running Spark on EC2 with the scripts
- deploying, on Elastic MapReduce (EMR) / Deploying Spark on Elastic MapReduce
- deploying, with Chef / Deploying Spark with Chef (Opscode)
- deploying, on Mesos / Deploying Spark on Mesos
- URL, for configuration details on YARN / Spark on YARN
- standalone mode / Spark Standalone mode
- references / Building your Spark job with something else
- uisng, with other languages / Using Spark with other languages
- Spark, building from source
- about / Building Spark from source
- download source / Downloading the source
- source, compiling with Maven / Compiling the source with Maven
- compilation switches / Compilation switches
- Spark, on YARN
- about / Spark on YARN
- SparkContext
- references / Python
- interactions, testing with / Testing interactions with SparkContext
- SparkContext object
- creating, in Scala / Scala
- creating, in Java / Java
- metadata / SparkContext – metadata
- creating, in Python / Python
- Spark documentation
- URL, for configuration / Memory usage and garbage collection
- URL, for RDDs / Memory usage and garbage collection
- Spark Java function classes
- about / Spark Java function classes
- Function<T,R> / Spark Java function classes
- DoubleFunction<T> / Spark Java function classes
- PairFunction<T, K, V> / Spark Java function classes
- FlatMapFunction<T, R> / Spark Java function classes
- PairFlatMapFunction<T, K, V> / Spark Java function classes
- DoubleFlatMapFunction<T> / Spark Java function classes
- Function2<T1, T2, R> / Spark Java function classes
- Spark job
- building, with Maven / Building your Spark job with Maven
- building / Building your Spark job with something else
- Spark machine learning algorithm table
- Spark MLlib examples
- about / Spark MLlib examples
- basic statistics / Basic statistics
- linear regression / Linear regression
- classification / Classification
- clustering / Clustering
- recommendation / Recommendation
- Spark project
- building, with sbt / Building your Spark project with sbt
- Spark shell
- used, for running Logistic regression / Using the Spark shell to run logistic regression
- running, in Python / Running Spark shell in Python
- Spark SQL
- architecture / The Spark SQL architecture
- overview / Spark SQL how-to in a nutshell
- multiple tables, handling with / Handling multiple tables with Spark SQL
- references / Aftermath
- Spark SQL programming
- about / Spark SQL programming
- SQL access, to simple data table / SQL access to a simple data table
- Spark SQL programming guide
- reference link / Spark SQL programming
- Spark topology
- about / Spark topology
- SQL scripts, Northwind database
- reference link / Spark SQL programming
- standalone mode, Spark
- reference link / Spark Standalone mode
- standard RDD functions
- about / Standard RDD functions
- flatMap / Standard RDD functions
- mapParitions / Standard RDD functions
- filter / Standard RDD functions
- distinct / Standard RDD functions
- union / Standard RDD functions
- cartesian / Standard RDD functions
- groupBy / Standard RDD functions
- pipe / Standard RDD functions
- foreach / Standard RDD functions
- reduce / Standard RDD functions
- fold / Standard RDD functions
- countByValue / Standard RDD functions
- take / Standard RDD functions
- partitionBy / Standard RDD functions
T
- testing
- references / Testing in Python
- type inference / Manipulating your RDD in Scala and Java
Y
- YARN
- about / Spark on YARN