In this recipe, we cover the IndexRowMatrix
, which is the first distributed matrix that we cover in this chapter. The primary advantage of IndexedRowMatrix
is that the index can be carried along with the row (RDD), which is the data itself.
In the case of IndexRowMatrix
, we have an index by the developer which is permanently paired with a given row that is very useful for random access cases. The index not only helps with random access, but is also used for identifying the row itself when performing join()
operations.
- Start a new project in IntelliJ or in an IDE of your choice. Make sure that the necessary JAR files are included.
- Import the necessary packages for vector and matrix manipulation:
import org.apache.spark.mllib.linalg.distributed.RowMatrix import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix} import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry...