Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option:
Compiling for a specific Hadoop version
Adding the Hive integration
Adding the YARN integration
The following are the prerequisites for this recipe to work:
Java 1.6 or a later version
Maven 3.x
The following are the steps to build the Spark source code with Maven:
Increase
MaxPermSize
for heap:$ echo "export _JAVA_OPTIONS=\"-XX:MaxPermSize=1G\"" >> /home/hduser/.bashrc
Open a new terminal window and download the Spark source code from GitHub:
$ wget https://github.com/apache/spark/archive/branch-1.4.zip
Unpack the archive:
$ gunzip branch-1.4.zip
Move to the
spark
directory:$ cd spark
Compile the sources with these flags: Yarn enabled, Hadoop version 2.4, Hive enabled, and skipping tests for faster compilation:
$ mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -DskipTests clean package
Move the
conf
folder to theetc
folder so that it can be made a symbolic link:$ sudo mv spark/conf /etc/
Move the
spark
directory to/opt
as it's an add-on software package:$ sudo mv spark /opt/infoobjects/spark
Change the ownership of the
spark
home directory toroot
:$ sudo chown -R root:root /opt/infoobjects/spark
Change the permissions of the
spark
home directory0755 = user:rwx group:r-x world:r-x
:$ sudo chmod -R 755 /opt/infoobjects/spark
Move to the
spark
home directory:$ cd /opt/infoobjects/spark
Create a symbolic link:
$ sudo ln -s /etc/spark conf
Put the Spark executable in the path by editing
.bashrc
:$ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >> /home/hduser/.bashrc
Create the
log
directory in/var
:$ sudo mkdir -p /var/log/spark
Make
hduser
the owner of the Sparklog
directory:$ sudo chown -R hduser:hduser /var/log/spark
Create the Spark
tmp
directory:$ mkdir /tmp/spark
Configure Spark with the help of the following command lines:
$ cd /etc/spark $ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh $ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh $ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh $ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh