Many modern business problems involve connections and relationships between entities, and are not solely based on discrete data. Graphs are powerful at representing complex interconnections, and graph data modeling is very effective and flexible when the number and depth of relationships increase exponentially.
The use cases for graph analytics are diverse: social networks, transportation routes, autonomous vehicles, cyber security, criminal networks, fraud detection, health research, epidemiology, and so forth.
This chapter contains the following information:
What is a Graph?
Graphs represent the interconnections between objects (vertices) and their relationships (edges). Example objects could be people, locations, cities, computers, or components on a circuit board. Example connections could be roads, circuits, cables, or interpersonal relationships. Edges can have directions and weights, for example the distance between towns.
Graphs can be small and easily traversed - as with a small group of friends - or extremely large and complex, similar to contacts in a modern-day social network.
Graph Analytics on Greenplum
Efficient processing of very large graphs can be challenging. Greenplum offers a suitable environment for this work for these key reasons:
- Using MADlib graph functions in Greenplum brings the graph computation close to where the data lives. Otherwise, large data sets need to be moved to a specialized graph database, requiring additional time and resources.
Specialized graph databases frequently use purpose-built languages. With Greenplum, you can invoke graph functions using the familiar SQL interface. For example, for the PageRank graph algorithm:
SELECT madlib.pagerank('vertex', -- Vertex table
'id', -- Vertex id column
'edge', -- Edge table
'src=src, dest=dest', -- Comma delimited string of edge arguments
'pagerank_out', -- Output table of PageRank
0.5); -- Damping factor
SELECT * FROM pagerank_out ORDER BY pagerank DESC;
A lot of data science problems are solved using a combination of models, with graphs being just one. Regression, clustering, and other methods available in Greenplum, make for a powerful combination.
Greenplum offers great benefits of scale, taking advantage of years of query execution and optimization research focused on large data sets.
Using Graph
Installing Graph Modules
To use the MADlib graph modules, install the version of MADlib corresponding to your Greenplum Database version. To download the software, access the VMware Tanzu Network. For Greenplum 6.x, see Installing MADlib.
Graph modules on MADlib support many algorithms.
Creating a Graph in Greenplum
To represent a graph in Greenplum, create tables that represent the vertices, edges, and their properties.
Using SQL, create the relevant tables in the database you want to use. This example uses testdb
:
gpadmin@mdw ~]$ psql
dev=# \c testdb
Create a table for vertices, called vertex
, and a table for edges and their weights, called edge
:
testdb=# DROP TABLE IF EXISTS vertex, edge;
testdb=# CREATE TABLE vertex(id INTEGER);
testdb=# CREATE TABLE edge(
src INTEGER,
dest INTEGER,
weight FLOAT8
);
Insert values related to your specific use case. For example :
testdb#=> INSERT INTO vertex VALUES
(0),
(1),
(2),
(3),
(4),
(5),
(6),
(7);
testdb#=> INSERT INTO edge VALUES
(0, 1, 1.0),
(0, 2, 1.0),
(0, 4, 10.0),
(1, 2, 2.0),
(1, 3, 10.0),
(2, 3, 1.0),
(2, 5, 1.0),
(2, 6, 3.0),
(3, 0, 1.0),
(4, 0, -2.0),
(5, 6, 1.0),
(6, 7, 1.0);
Now select the Graph Module that suits your analysis.
Graph Modules
This section lists the graph functions supported in MADlib. They include: All Pairs Shortest Path (APSP), Breadth-First Search, Hyperlink-Induced Topic Search (HITS), PageRank and Personalized PageRank, Single Source Shortest Path (SSSP), Weakly Connected Components, and Measures. Explore each algorithm using the example edge
and vertex
tables already created.
All Pairs Shortest Path (APSP)
The all pairs shortest paths (APSP) algorithm finds the length (summed weights) of the shortest paths between all pairs of vertices, such that the sum of the weights of the path edges is minimized.
The function is:
graph_apsp( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
grouping_cols
)
For details on the parameters, with examples, see the All Pairs Shortest Path in the Apache MADlib documentation.
Breadth-First Search
Given a graph and a source vertex, the breadth-first search (BFS) algorithm finds all nodes reachable from the source vertex by searching / traversing the graph in a breadth-first manner.
The function is:
graph_bfs( vertex_table,
vertex_id,
edge_table,
edge_args,
source_vertex,
out_table,
max_distance,
directed,
grouping_cols
)
For details on the parameters, with examples, see the Breadth-First Search in the Apache MADlib documentation.
Hyperlink-Induced Topic Search (HITS)
The all pairs shortest paths (APSP) algorithm finds the length (summed weights) of the shortest paths between all pairs of vertices, such that the sum of the weights of the path edges is minimized.
The function is:
graph_apsp( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
grouping_cols
)
For details on the parameters, with examples, see the Hyperlink-Induced Topic Search in the Apache MADlib documentation.
PageRank and Personalized PageRank
Given a graph, the PageRank algorithm outputs a probability distribution representing a person’s likelihood to arrive at any particular vertex while randomly traversing the graph.
MADlib graph also includes a personalized PageRank, where a notion of importance provides personalization to a query. For example, importance scores can be biased according to a specified set of graph vertices that are of interest or special in some way.
The function is:
pagerank( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
damping_factor,
max_iter,
threshold,
grouping_cols,
personalization_vertices
)
For details on the parameters, with examples, see the PageRank in the Apache MADlib documentation.
Single Source Shortest Path (SSSP)
Given a graph and a source vertex, the single source shortest path (SSSP) algorithm finds a path from the source vertex to every other vertex in the graph, such that the sum of the weights of the path edges is minimized.
The function is:
graph_sssp ( vertex_table,
vertex_id,
edge_table,
edge_args,
source_vertex,
out_table,
grouping_cols
)
For details on the parameters, with examples, see the Single Source Shortest Path in the Apache MADlib documentation.
Weakly Connected Components
Given a directed graph, a weakly connected component (WCC) is a subgraph of the original graph where all vertices are connected to each other by some path, ignoring the direction of edges.
The function is:
weakly_connected_components(
vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
grouping_cols
)
For details on the parameters, with examples, see the Weakly Connected Components in the Apache MADlib documentation.
Measures
These algorithms relate to metrics computed on a graph and include: Average Path Length, Closeness Centrality , Graph Diameter, and In-Out Degree.
Average Path Length
This function computes the shortest path average between pairs of vertices. Average path length is based on “reachable target vertices”, so it averages the path lengths in each connected component and ignores infinite-length paths between unconnected vertices. If the user requires the average path length of a particular component, the weakly connected components function may be used to isolate the relevant vertices.
The function is:
graph_avg_path_length( apsp_table,
output_table
)
This function uses a previously run APSP (All Pairs Shortest Path) output. For details on the parameters, with examples, see the Average Path Length in the Apache MADlib documentation.
Closeness Centrality
The closeness centrality algorithm helps quantify how much information passes through a given vertex. The function returns various closeness centrality measures and the k-degree for a given subset of vertices.
The function is:
graph_closeness( apsp_table,
output_table,
vertex_filter_expr
)
This function uses a previously run APSP (All Pairs Shortest Path) output. For details on the parameters, with examples, see the Closeness in the Apache MADlib documentation.
Graph Diameter
Graph diameter is defined as the longest of all shortest paths in a graph. The function is:
graph_diameter( apsp_table,
output_table
)
This function uses a previously run APSP (All Pairs Shortest Path) output. For details on the parameters, with examples, see the Graph Diameter in the Apache MADlib documentation.
In-Out Degree
This function computes the degree of each node. The node degree is the number of edges adjacent to that node. The node in-degree is the number of edges pointing in to the node and node out-degree is the number of edges pointing out of the node.
The function is:
graph_vertex_degrees( vertex_table,
vertex_id,
edge_table,
edge_args,
out_table,
grouping_cols
)
For details on the parameters, with examples, see the In-out Degree page in the Apache MADlib documentation.
References
MADlib on Greenplum is at Machine Learning and Deep Learning using MADlib.
MADlib Apache web site and MADlib release notes are at http://madlib.apache.org/.
MADlib user documentation is at http://madlib.apache.org/documentation.html.