Book Image

Apache Oozie Essentials

By : Jagat Singh
Book Image

Apache Oozie Essentials

By: Jagat Singh

Overview of this book

As more and more organizations are discovering the use of big data analytics, interest in platforms that provide storage, computation, and analytic capabilities is booming exponentially. This calls for data management. Hadoop caters to this need. Oozie fulfils this necessity for a scheduler for a Hadoop job by acting as a cron to better analyze data. Apache Oozie Essentials starts off with the basics right from installing and configuring Oozie from source code on your Hadoop cluster to managing your complex clusters. You will learn how to create data ingestion and machine learning workflows. This book is sprinkled with the examples and exercises to help you take your big data learning to the next level. You will discover how to write workflows to run your MapReduce, Pig ,Hive, and Sqoop scripts and schedule them to run at a specific time or for a specific business requirement using a coordinator. This book has engaging real-life exercises and examples to get you in the thick of things. Lastly, you’ll get a grip of how to embed Spark jobs, which can be used to run your machine learning models on Hadoop. By the end of the book, you will have a good knowledge of Apache Oozie. You will be capable of using Oozie to handle large Hadoop workflows and even improve the availability of your Hadoop environment.
Table of Contents (16 chapters)
Apache Oozie Essentials
Credits
About the Author
About the Reviewers
www.PacktPub.com
Preface
Index

Configuring Oozie in Hortonworks distribution


In this section, we will learn how to configure Oozie inside Hortonworks Hadoop distribution using Ambari. We will configure the Oozie server to use a MySQL database instead of the default Derby database to store all job information.

We will use a virtual machine to learn how to configure Oozie in Hortonworks Hadoop distribution. Most of other distributions, such as Cloudera, Pivotal, and so on, have similar steps.

Let's start with the following steps:

  1. If you don't have VirtualBox on your machine, then download and install VirtualBox from https://www.virtualbox.org/wiki/Downloads.

  2. Download the Hortonworks single node virtual machine from http://hortonworks.com/hdp/downloads/. It will take 1-2 hours depending upon your Internet connection speed.

    Tip

    It is always good to store the virtual machine images in a common folder. For example, I have folder in my machine such as ~/dev/vm/. It makes virtual machine image management easier.

  3. After the download is complete, open the VirtualBox and click on File | Import Appliance:

    Import appliance

  4. Click on the Import Appliance button, browse to the place where you downloaded the virtual machine image, and then click on Continue.

  5. Wait till the VirtualBox imports the new machine.

  6. Once you can see the machine is imported, click on Start machine in the virtual machine console.

  7. On completion of boot process of the machine, you can log in to the Ambari dashboard by opening the URL http://127.0.0.1:8080 in your browser.

  8. Use the username as well as password as admin.

    It will take some time for all services to start up and report their status to Ambari. Once the system has reported the status, all services have a glance at the Ambari console. It is also a good idea to stop the services which we are not using to reduce the load on the system.

  9. In the Ambari dashboard, click on the link named Oozie on the left side. You can see there are two components for Oozie, Oozie Server and Oozie Client. Since we are using a single node cluster, we have both the server and client installed on the same machine. In the production environment, you will configure the Oozie server and clients separately on different machines. Using the client, we will submit the jobs to server. Before submitting the job, we will tell where the server is located using the OOZIE_URL variable.

    Tip

    To save time in manually specifying the Oozie server on the client machine every time, you can set the environment variable OOZIE_URL in your bash_profile or environment file depending on the operating system you use. You should say export OOZIE_URL=http://oozieserver:11000/oozie; in this book oozieserver will be localhost.

  10. Now click on the Config link at the top and we will configure the database as MySQL. The Oozie server will use MySQL to store the job information:

    Ambari Oozie configuration

  11. You may notice, at this moment, the server has been configured to use a Derby database. Derby is good for playing and testing, but not for running the production sever. We will configure it to use a MySQL-based database.

  12. Log in to the virtual machine using SSH as follows:

    $ ssh [email protected] -p 2222
    

    The default password is hadoop.

  13. After you log in to the SSH session, log in to MySQL:

    $ mysql -u root
    
  14. Since this is a test virtual machine, the password is not configured. In production, you will be having password protection.

  15. At the MySQL prompt, execute the following SQL statements:

    CREATE USER 'oozie'@'%' IDENTIFIED BY 'hadoop';
    CREATE DATABASE oozie;
    GRANT ALL PRIVILEGES ON oozie.* TO 'oozie'@'%' WITH GRANT OPTION;
    

    The following output will be generated:

    Oozie database creation

  16. To make Oozie work with MySQL, we need to get driver for it. Let's download the MySQL JDBC driver from the MySQL JDBC jar download section. Extract the jar to a folder such as /root/mysql inside the virtual machine:

    $ cd ~/
    $ mkdir mysql
    $ cd mysql
    $ # Download the MySQL JDBC Driver
    $ wget http://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.36.tar.gz
    $ # Extract tar
    $ tar -xvf mysql-connector-java-5.1.36.tar.gz
    $ # Tell Ambari that we got new MYSQL JDBC driver which it can use
    $ ambari-server setup --jdbc-db=mysql --jdbc-driver=/root/mysql/mysql-connector-java-5.1.36/mysql-connector-java-5.1.36-bin.jar
    
  17. In the Ambari dashboard, configure the MySQL database with the following details:

    Field name

    Value

    Database Name

    oozie

    Database Username

    oozie

    Database Password

    hadoop

    JDBC Driver Class

    com.mysql.jdbc.Driver

    JDBC Database URL

    jdbc:mysql://localhost:3306/${oozie.db.schema.name}?createDatabaseIfNotExist=true

  18. In the Ambari dashboard page, click on Test Connection. If all is good, there should be a green tick. So, we have now configured the Oozie server to use MySQL database instead of Derby.

  19. Finally, to confirm that Oozie works properly, in another browser tab open the Oozie dashboard by entering the URL http://127.0.0.1:11000/oozie.

This completes the first section in which we learned how to configure Oozie for Hortonworks Ambari distribution.