Loading and Saving of Model

Loading and Saving of Model

For loading and saving for model, the common scences is:

Save the model that has been trained for a while to facilitate the next training.
Save trained model for reproduction(Such as Model Serving).

Strictly speaking, we save the untrained model as checkpoint or snapshot. It is different from model saving of a completed model.

However in Oneflow, despite the model has been trained or not, we can use the same interface to save model. Thus, like the model、checkpoint、snapshot we see in other framework is no difference in OneFlow.

In OneFlow, there are interfaces for model saving and loading under the flow.checkpoint.

In this article, we will introduce:

How to create model parameters
How to save and load model
Storage structure of OneFlow model
How to finetune and extend model

Use get_variable to Create/Obtain Model Parameters Object

We can use oneflow.get_variable to create or obtain an object and this object can be used to interact with information in global job functions. When we call the interfaces of oneflow.get_all_variables and oneflow.load_variables, we can get or update the value of the object created by get_variable.

Because of this feature, the object created by get_variable is used to store model parameters. In fact, there are many high level interface in OneFlow (like oneflow.layers.conv2d) use get_variable internally to create model parameters internally.

Process

The get_variable requires a specified name as the identity of the created object.

If the name value already existed in the program, then get_variable will get the existed object and return.

If the name value doesn’t exist in the program, get_variable will create a blob object internally and return.

Use get_variable Create Object

The signature of oneflow.get_variable is:

def get_variable(
    name,
    shape=None,
    dtype=None,
    initializer=None,
    regularizer=None,
    trainable=None,
    model_name=None,
    random_seed=None,
    distribute=distribute_util.broadcast(),
)

The following example use get_variable to create parameters and build the network with oneflow.layers.conv2d:

    #...
    weight = flow.get_variable(
        weight_name if weight_name else name_prefix + "-weight",
        shape=weight_shape,
        dtype=inputs.dtype,
        initializer=kernel_initializer
        if kernel_initializer is not None
        else flow.constant_initializer(0),
        regularizer=kernel_regularizer,
        trainable=trainable,
        model_name="weight",
    )
    output = flow.nn.conv2d(
        inputs, weight, strides, padding, data_format, dilation_rate, groups=groups, name=name
    )
    #...

Initializer Setting

In the previous sections, when we call get_variable, we specify the method of initializing the parameters by initializer. In OneFlow, we provide many initializers which can be found in oneflow.

Under the static graph mechanism, we set the initializer first, and parameter initialization will be done by the OneFlow framework automatically.

The initializers currently supported by OneFlow are listed below. Click it to see the details of algorithm:

The Python Interface of OneFlow Models

We can use the following interfaces to get or update the value of the variable object created by oneflow.get_variable in job function.

oneflow.get_all_variables : Get the variable of all job functions.
oneflow.load_variables : Update the variable in job function.

oneflow.get_all_variables returns a dictionary whose key is the name specified when creating the variable and the value corresponding to the key is a tensor which has numpy() method to convert itself to a numpy array.

For example, creating an object named myblob in job function:

@flow.global_function()
def job() -> tp.Numpy:
    ...
    myblob = flow.get_variable("myblob",
        shape=(3,3),
        initializer=flow.random_normal_initializer()
        )
    ...

If we want to print the value of myblob, we can call:

...
for epoch in range(20):
    ...
    job()
    all_variables = flow.get_all_variables()
    print(all_variables["myblob"].numpy())
    ...

The flow.get_all_variables gets the dictionary and all_variables["myblob"].numpy() gets the myblob object then converts it to a numpy array.

By contrary, we can use oneflow.load_variables to update the values of variable.

The signature of oneflow.load_variables is as follows:

def load_variables(value_dict, ignore_mismatch = True)

Before call load_variables, we have to prepare a dictionary whose key is the name specified when creating variable and value is a numpy array. After passing the dictionary to load_variables, load_variables will find the variable object in the job function based on the key and update the value.

For example:

@flow.global_function(type="predict")
def job() -> tp.Numpy:
    myblob = flow.get_variable("myblob",
        shape=(3,3),
        initializer=flow.random_normal_initializer()
        )
    return myblob
myvardict = {"myblob": np.ones((3,3)).astype(np.float32)}
flow.load_variables(myvardict)
print(flow.get_all_variables()["myblob"].numpy())

Although we have chosen the random_normal_initializer initializer, flow.load_variables(myvardict) updates the value of myblob. The final output will be:

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]

Model Saving and Loading

We can save or load the model by methods:

oneflow.checkpoint.save : Save the model to the specified path.
oneflow.checkpoint.get : Load a model from the specified path.

The signature of save is as follows which saves the model to the path specified by path.

def save(path, var_dict=None)

If the optional parameter var_dict is not None, save will save the object specified in var_dict to the specified path.

The signature of get is as follows which loads the previously saved model specified by the path.

def get(path)

It will return a dictionary that can be updated into the model using the load_variables.

flow.load_variables(flow.checkpoint.get(save_dir))

Attention：

The path specified by the save should either be empty or not existed. Otherwise save will report an error (to prevent overwriting the existed saved model)
OneFlow models are stored in a specified path in a certain structure. See the storage structure of OneFlow models below for more details.
Although there is no limit to the frequency of save in OneFlow. But excessive saving frequency will increase the load on resources such as disk and bandwidth.

The Structure of OneFlow Saved Model

OneFlow model are the parameters of network. For now there are no meta graph information in OneFlow model. The path to save model have many sub-directories. Each of them is corresponding to the name of job function in model. For example, we define the model in the first place:

def lenet(data, train=False):
    initializer = flow.truncated_normal(0.1)
    conv1 = flow.layers.conv2d(
        data,
        32,
        5,
        padding="SAME",
        activation=flow.nn.relu,
        name="conv1",
        kernel_initializer=initializer,
    )
    pool1 = flow.nn.max_pool2d(
        conv1, ksize=2, strides=2, padding="SAME", name="pool1", data_format="NCHW"
    )
    conv2 = flow.layers.conv2d(
        pool1,
        64,
        5,
        padding="SAME",
        activation=flow.nn.relu,
        name="conv2",
        kernel_initializer=initializer,
    )
    pool2 = flow.nn.max_pool2d(
        conv2, ksize=2, strides=2, padding="SAME", name="pool2", data_format="NCHW"
    )
    reshape = flow.reshape(pool2, [pool2.shape[0], -1])
    hidden = flow.layers.dense(
        reshape,
        512,
        activation=flow.nn.relu,
        kernel_initializer=initializer,
        name="dense1",
    )
    if train:
        hidden = flow.nn.dropout(hidden, rate=0.5, name="dropout")
    return flow.layers.dense(hidden, 10, kernel_initializer=initializer, name="dense2")

Assume that in the process of training, we call the following code to save model:

flow.checkpoint.save('./lenet_models_name')

Then lenet_models_name and the subdirectories are as follows:

lenet_models_name/
├── conv1-bias
│   ├── meta
│   └── out
├── conv1-weight
│   ├── meta
│   └── out
├── conv2-bias
│   ├── meta
│   └── out
├── conv2-weight
│   ├── meta
│   └── out
├── dense1-bias
│   ├── meta
│   └── out
├── dense1-weight
│   ├── meta
│   └── out
├── dense2-bias
│   ├── meta
│   └── out
├── dense2-weight
│   ├── meta
│   └── out
├── snapshot_done
└── System-Train-TrainStep-train_job
    ├── meta
    └── out

We can see:

In the network in job function, each variable is corresponding to a sub-directory.
In each of the subdirectories, there are out and meta files where out stores the values of the network parameters in binary form and meta stores the network structure information in text form.
Snapshot_done is an empty folder. If it exists, it means that the network training has been finished.
Snapshots of the training steps is stored in System-Train-TrainStep-train_job.

Model Finetune and Transfer Learning

In model finetune and transfer learning, we always need：

Load some of the parameters from original model
Initialize the other part of parameters in model

We can use oneflow.load_variables to complete the process above. Here is a simple example to illustrate the concept.

First we need define a model and save it to ./mlp_models_1 after training:

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("cpu", "0:0"):
        initializer = flow.truncated_normal(0.1)
        reshape = flow.reshape(images, [images.shape[0], -1])
        hidden = flow.layers.dense(
            reshape,
            512,
            activation=flow.nn.relu,
            kernel_initializer=initializer,
            name="dense1",
        )
        dense2 = flow.layers.dense(
            hidden, 10, kernel_initializer=initializer, name="dense2"
        )
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(labels, dense2)
    lr_scheduler = flow.optimizer.PiecewiseConstantScheduler([], [0.1])
    flow.optimizer.SGD(lr_scheduler, momentum=0).minimize(loss)
    return loss

Then we expand the network and add one more layer (dense3) in above model:

@flow.global_function(type="train")
def train_job(
    images: tp.Numpy.Placeholder((BATCH_SIZE, 1, 28, 28), dtype=flow.float),
    labels: tp.Numpy.Placeholder((BATCH_SIZE,), dtype=flow.int32),
) -> tp.Numpy:
    with flow.scope.placement("cpu", "0:0"):
        #... original structure
        dense3 = flow.layers.dense(
            dense2, 10, kernel_initializer=initializer, name="dense3"
        )
        loss = flow.nn.sparse_softmax_cross_entropy_with_logits(labels, dense3)
    #...

Finally, load parameters from original model and start training:

if __name__ == "__main__":
    check_point = flow.train.CheckPoint()
    check_point.load("./mlp_models_1")
    (train_images, train_labels), (test_images, test_labels) = flow.data.load_mnist(
        BATCH_SIZE, BATCH_SIZE
    )
    for i, (images, labels) in enumerate(zip(train_images, train_labels)):
        loss = train_job(images, labels)
        if i % 20 == 0:
            print(loss.mean())
    check_point.save("./mlp_ext_models_1")

The parameters of new dense3 layer do not exist in the original model. They are automatically initialized to their values by OneFlow.

Codes

The following code is from mlp_mnist_origin.py. As the backbone network. Trained model is stored in ./mlp_models_1.

Run:

wget https://docs.oneflow.org/master/code/basics_topics/mlp_mnist_origin.py
python3 mlp_mnist_origin.py

When the training is complete, you will get the mlp_models_1 in the current working directory.

The following code is from mlp_mnist_finetune.py. After finetuning (add one more layer dense3 in backbone network), we load ./mlp_models_1 and train it.