Spark GraphX - Distributed Graph Computations

Spark GraphX is a graph processing framework built on top of Spark.

GraphX models graphs as property graphs where vertices and edges can have properties.

Caution

FIXME Diagram of a graph with friends.

GraphX comes with its own package org.apache.spark.graphx.

Tip	Import `org.apache.spark.graphx` package to work with GraphX. `import org.apache.spark.graphx._`

Graph

Graph abstract class represents a collection of vertices and edges.

abstract class Graph[VD: ClassTag, ED: ClassTag]

vertices attribute is of type VertexRDD while edges is of type EdgeRDD.

Graph can also be described by triplets (that is of type RDD[EdgeTriplet[VD, ED]]).

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertices: RDD[(VertexId, String)] =
  sc.parallelize(Seq(
    (0L, "Jacek"),
    (1L, "Agata"),
    (2L, "Julian")))

val edges: RDD[Edge[String]] =
  sc.parallelize(Seq(
    Edge(0L, 1L, "wife"),
    Edge(1L, 2L, "owner")
  ))

scala> val graph = Graph(vertices, edges)
graph: org.apache.spark.graphx.Graph[String,String] = org.apache.spark.graphx.impl.GraphImpl@5973e4ec

package object graphx

package object graphx defines two type aliases:

VertexId (Long) that represents a unique 64-bit vertex identifier.
PartitionID (Int) that is an identifier of a graph partition.

Standard GraphX API

Graph class comes with a small set of API.

Transformations
- mapVertices
- mapEdges
- mapTriplets
- reverse
- subgraph
- mask
- groupEdges
Joins
- outerJoinVertices
Computation
- aggregateMessages

Creating Graphs (Graph object)

Graph object comes with the following factory methods to create instances of Graph:

fromEdgeTuples
fromEdges
apply

Note	The default implementation of `Graph` is GraphImpl.

GraphOps - Graph Operations

GraphImpl

GraphImpl is the default implementation of Graph abstract class.

It lives in org.apache.spark.graphx.impl package.

OLD - perhaps soon to be removed

Apache Spark comes with a library for executing distributed computation on graph data, GraphX.

Apache Spark graph analytics
GraphX is a pure programming API
- missing a graphical UI to visually explore datasets
- Could TitanDB be a solution?

From the article Merging datasets using graph analytics:

Such a situation, in which we need to find the best matching in a weighted bipartite graph, poses what is known as the stable marriage problem. It is a classical problem that has a well-known solution, the Gale–Shapley algorithm.

A popular model of distributed computation on graphs known as Pregel was published by Google researchers in 2010. Pregel is based on passing messages along the graph edges in a series of iterations. Accordingly, it is a good fit for the Gale–Shapley algorithm, which starts with each “gentleman” (a vertex on one side of the bipartite graph) sending a marriage proposal to its most preferred single “lady” (a vertex on the other side of the bipartite graph). The “ladies” then marry their most preferred suitors, after which the process is repeated until there are no more proposals to be made.

The Apache Spark distributed computation engine includes GraphX, a library specifically made for executing distributed computation on graph data. GraphX provides an elegant Pregel interface but also permits more general computation that is not restricted to the message-passing pattern.

Spark GraphX — Distributed Graph Computations