A lot of important data lies in relational databases that Spark needs to query. JdbcRDD is a Spark feature that allows relational tables to be loaded as RDDs. This recipe will explain how to use JdbcRDD.
Spark SQL to be introduced in the next chapter includes a data source for JDBC. This should be preferred over the current recipe as results are returned as DataFrames (to be introduced in the next chapter), which can be easily processed by Spark SQL and also joined with other data sources.
Please make sure that the JDBC driver JAR is visible on the client node and all slaves nodes on which executor will run.
Perform the following steps to load data from relational databases:
Create a table named
person
in MySQL using the following DDL:CREATE TABLE 'person' ( 'person_id' int(11) NOT NULL AUTO_INCREMENT, 'first_name' varchar(30) DEFAULT NULL, 'last_name' varchar(30) DEFAULT NULL, 'gender' char(1) DEFAULT NULL, PRIMARY...