In this recipe, we explore the RowMatrix
facility that is by Spark. RowMatrix
, as the name implies, is a row-oriented matrix with the catch being the lack of an index that can be defined and carried through the computational life cycle of a RowMatrix
. The rows are RDDs provide distributed computing and resiliency with fault tolerance.
The matrix is made of rows of local vectors that are parallelized and distributed via RDDs. In short, each row will be an RDD, but the total number of columns will be limited by the maximum size of a local vector. This is not an issue in most cases, but we felt we should mention it for completion.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Import the necessary packages for vector and matrix manipulation:
import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix...