Installing Spark using binaries works fine in most cases. For advanced cases, such as the following (but not limited to), compiling from the source code is a better option:
- Compiling for a specific Hadoop version
- Adding the Hive integration
- Adding the YARN integration
The following are the prerequisites for this recipe to work:
- Java 1.8 or a later version
- Maven 3.x
The following are the steps to build the Spark source code with Maven:
- Increase
MaxPermSize
of the heap:
$ echo "export _JAVA_OPTIONS="-XX:MaxPermSize=1G"" >>
/home/hduser/.bashrc
- Open a new terminal window and download the Spark source code from GitHub:
$ wget https://github.com/apache/spark/archive/branch-2.1.zip
- Unpack the archive:
$ unzip branch-2.1.zip
- Rename unzipped folder to
spark
:
$ mv spark-branch-2.1 spark
- Move to the
spark
directory:
$ cd spark
- Compile the sources with the YARN-enabled, Hadoop version 2.7, and Hive-enabled flags and skip the tests for faster compilation:
$ mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -
DskipTests clean package
- Move the
conf
folder to theetc
folder so that it can be turned into a symbolic link:
$ sudo mv spark/conf /etc/
- Move the
spark
directory to/opt
as it's an add-on software package:
$ sudo mv spark /opt/infoobjects/spark
- Change the ownership of the
spark
home directory toroot
:
$ sudo chown -R root:root /opt/infoobjects/spark
- Change the permissions of the
spark
home directory, namely0755 = user:rwx group:r-x world:r-x
:
$ sudo chmod -R 755 /opt/infoobjects/spark
- Move to the
spark
home directory:
$ cd /opt/infoobjects/spark
- Create a symbolic link:
$ sudo ln -s /etc/spark conf
- Put the Spark executable in the path by editing
.bashrc
:
$ echo "export PATH=$PATH:/opt/infoobjects/spark/bin" >>
/home/hduser/.bashrc
- Create the
log
directory in/var
:
$ sudo mkdir -p /var/log/spark
- Make
hduser
the owner of Spark'slog
directory:
$ sudo chown -R hduser:hduser /var/log/spark
- Create Spark's
tmp
directory:
$ mkdir /tmp/spark
- Configure Spark with the help of the following command lines:
$ cd /etc/spark $ echo "export HADOOP_CONF_DIR=/opt/infoobjects/hadoop/etc/hadoop" >> spark-env.sh $ echo "export YARN_CONF_DIR=/opt/infoobjects/hadoop/etc/Hadoop" >> spark-env.sh $ echo "export SPARK_LOG_DIR=/var/log/spark" >> spark-env.sh $ echo "export SPARK_WORKER_DIR=/tmp/spark" >> spark-env.sh