Classify Cora Dataset Using GCN

Classify Cora Dataset Using GCN

This tutorial shows how to train a GCN model on the Cora dataset. In this tutorial, you will learn how to:

Train a GCN model on the Cora citation dataset.
Use the trained GCN model to predict the label for some of the papers in the dataset.

The Dataset

The Cora dataset is a graph dataset about a citation network of scientific papers. It consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words.

Due to the fact that graph data is not applicable to be divided into batches, we are not able to split the data and store in the training dataset and test dataset respectively. Thus, we represent the entire graph through two tables: Node Table and Edge Table. (Anyone who wants to store graph data in the database can refer to this method.)

Here are the column description of the Node Table:

Column	Explain	Type
id	Id for the node.	Integer
name	Name for the node.	Text
features	Feature vector of the node represented in the `csv` format. An example would be “0,0,1”.	Text
label	Label for the node.	Text

The following command can be used to construct the Node Table.

CREATE TABLE cora.node (
        id INT,
        node_name TEXT,
        features  TEXT,
        label TEXT);

Here are the column description of the Edge Table:

Column	Explain	Type
id	Id for the edge.	Integer
from_node_id	Id for the from node of the edge.	Integer
to_node_id	Id for the to node of the edge.	Integer
weight	Weight for the edge.	Float

The following command can be used to construct the Edge Table.

CREATE TABLE cora.edge (
        id INT,
        from_node_id INT,
        to_node_id  INT,
        weight FLOAT);

You can have a quick peek of the data by running the following standard SQL statements.

%%sqlflow
DESCRIBE cora.node;
DESCRIBE cora.edge;

%%sqlflow
SELECT * FROM cora.node LIMIT 10;

Train the GCN on the Cora dataset

Let’s train a GCN model!

Load data from Cora

You can load the data from the database following a standard SQL command such as SELECT * FROM cora.node. However, since the GCN model is supposed to deal with the graph data, you have to load both the Node Table and Edge Table at once.

In order to do so, you need to use the JOIN command in SQL to select all the data from Node Table and Edge Table. The following command is used to load all the data for training GCN.

%%sqlflow
SELECT cora.node.id, features, label as class, cora.edge.from_node_id, cora.edge.to_node_id FROM cora.node
LEFT JOIN cora.edge ON (cora.node.id = cora.edge.from_node_id OR cora.node.id = cora.edge.to_node_id)
ORDER BY cora.node.id;

The OR statement in the command is used to select all the possible bidirectional edges from the dataset. Without this OR statement, some of the edges will be missing and it is not possible to construct the entire graph.

With the COLUMN clause provided, SQLFlow can handle the comma separated string features with command COLUMN DENSE(features).

The GCN model in SQLFlow is able to build the entire graph automatically with inputs in the folloing order: node.id, node.features, node.label, edge.from_node_id, edge.to_node_id. Please make sure the order is correct in order to run the GCN model successfully.

Train GCN

Here is the table that lists all the parameters of the GCN model:

Parameter	Description	Type
nhid	Number of hidden units for GCN.	Integer
nclass	Number of classes in total which will be the output dimension.	Integer
epochs	Number of epochs for the model to be trained.	Integer
train_ratio	Percentage of data to be used for training.	Float
eval_ratio	Percentage of data points to be used for evaluating.	Float
early_stopping	Whether to use early stopping trick during the training phase.	Boolean
dropout	The rate for dropout.	Float
nlayer	Number of GCNLayer to be used in the model.	Integer
id_col	Name for the column in database to be used as the id of each node.	String
feature_col	Name for the column in database to be used as the features of each node.	String
from_node_col	Name for the column in database to be used as the from_node id of each edge.	String
to_node_col	Name for the column in database to be used as the to_node id of each edge.	String

After loading the dataset, you would be able to train the GCN model with following command:

TO TRAIN sqlflow_models.GCN
WITH model.nhid=16, 
     model.nclass=7, 
     model.epochs=200, 
     model.train_ratio=0.15, 
     model.eval_ratio=0.2, 
     validation.metrics="CategoricalAccuracy"

You can specify the model parameters and training configurations through the WITH clause. For instance, you could set the model.epochs to be trained to be 100. model.train_ratio and model.eval_ratio indicate the proportion of the dataset to used for training and evaluate respectively. You can also change configurations such as model.nlayer which decides the number GCNLayer to be used, and model.dropout which defines the dropout rate of the model. (For more parameters, please refer to the table above.)

Combing with the data loading commands, you can start to train the GCN model using:

%%sqlflow
SELECT cora.node.id, features, label as class, cora.edge.from_node_id, cora.edge.to_node_id FROM cora.node
LEFT JOIN cora.edge ON (cora.node.id = cora.edge.from_node_id OR cora.node.id = cora.edge.to_node_id)
ORDER BY cora.node.id
TO TRAIN sqlflow_models.GCN
WITH model.nhid=16, model.nclass=7, 
     model.epochs=200, model.train_ratio=0.15, 
     model.eval_ratio=0.2, validation.metrics="CategoricalAccuracy"
COLUMN DENSE(features)
LABEL class
INTO sqlflow_models.gcn_model;

The details of the training will be outputed in the following format:

Epoch 100 loss=0.455858 accuracy=0.943350 val_acc=0.857934

ATTENTION: if you store the data in the database with different column names for id, features, from_node_id and to_node_id, you need to specify the name through WITH command in order to let the model get the data successfully:

WITH model.id_col='id', -- string to be the name for id of each node
     model.feature_col='features', -- ... name for feature column of each node
     model.from_node_col='from_node_id', -- ... name for from_node_id of each edge
     model.to_node_col='to_node_id' -- ... name for to_node_id of each edge

Predict the label of a paper in Cora dataset

To specify the prediction data, we use standard SQL statements like SELECT id FROM cora.node LIMIT 5.

The pretrained GCN model is previously stored at sqlflow_models.gcn_model. You could get the prediction data and write the predicted result into table cora.predict column class. Note that GCN only supports prediction using node’s id because all the prediction is already complete during training phase and results are stored regarding to node’s id. You can write the following SQLFlow prediction statement:

%%sqlflow
SELECT id FROM cora.node TO PREDICT cora.predict.class USING sqlflow_models.gcn_model;

After the prediction, you can check the prediction result by

%%sqlflow
SELECT * FROM cora.predict LIMIT 5;

Evaluate the Trained GCN Model

With the support of SQLFlow, you can evaluate the model’s performance on the evaluation dataset. GCN will generate the evaluation results during training phase, so one can get the evaluation result with:

%%sqlflow
SELECT cora.node.id, features, label as class, cora.edge.from_node_id, cora.edge.to_node_id FROM cora.node
LEFT JOIN cora.edge ON (cora.node.id = cora.edge.from_node_id OR cora.node.id = cora.edge.to_node_id)
ORDER BY cora.node.id
WITH model.nhid=16, model.nclass=7, 
     model.epochs=200, model.train_ratio=0.15, 
     model.eval_ratio=0.2, validation.metrics="CategoricalAccuracy"
COLUMN DENSE(features)
TO EVALUATE sqlflow_models.gcn_model
INTO gcn_evaluation;

gcn_evaluation is the result table that stores the evaluation results.