Sangala Shekhar Reddy: Apache Spark Mlib

Monday, July 10, 2017

Apache Spark Mlib - Sparse Vector

What is a Apache Spark Sparse Vector

A vector is a one-dimensional array of elements. So in a programming language, an implementation of a vector is as a one-dimensional array. A vector is said to be sparse when many elements of a have zero values. And when we write programs it will not be a good idea from storage perspective to store all these zero values in the array.

So the best way of representation of a sparse vector will be by just specifying the location and value.

Example: 3 1.2 2800 6.3 6000 10.0 50000 5.7

This denotes at position:

3 the value is 1.2
2800 holds value 6.3
6000 holds 10.0
50000 holds value 5.7

When you use the sparse vector in a programming language you will also need to specify a size. In the above example the size of the sparse vector is 4.

Representation of Sparse Vector in Spark

The Vector class of org.apache.spark.mllib.linalg has multiple methods to create your own dense and sparse Vectors.

The most simple way of creating one is by using:

sparse(int size, int[] indices, double[] values)

This method creates a sparse vector where the first argument is the size, second the indices where a value exists and the last one is the values on these indices.

Rest of the elements of this vector have values zero.

Example:

Let’s say we have to create the following vector {0.0, 5.0, 0.0, 3.0, 4.0}. By using the sparse vector API of Spark this can be created as stated below:

Vector sparseVector = Vectors.sparse(5, new int[] {1,3, 4}, new double[] {5.0,3.0, 4.0});

If the same vector can also be created using the dense vector API

Vector denseVector = Vectors.dense(0.0, 5.0, 0.0, 3.0, 4.0);

Sangala Shekhar Reddy

Monday, July 10, 2017

Apache Spark Mlib - Sparse Vector

No comments:

Popular Posts