Distributed Training

One of Caffe2’s most significant features is easy, built-in distributed training. This means that you can very quickly scale up or down without refactoring your design.

For a deeper dive and examples of distributed training, check out SynchronousSGD, where you’ll be taught the programming principles for using Caffe2’s data_parallel_model.

Under the Hood

  • Gloo: Caffe2 leverages, Gloo, a communications library for multi-machine training.
  • NCCL: Caffe2 also utilize’s NVIDIA’s NCCL for multi-GPU communications.
  • Redis To facilitate management of nodes in distributed training, Caffe2 can use a simple NFS share between nodes, or you can provide a Redis server to handle the nodes’ communications.

For an example of distributed training with Caffe2 you can run the resnet50_trainer script on a single GPU machine. The defaults assume that you’ve already loaded the training data into a lmdb database, but you have the additional option of using LevelDB. A guide for using the script is below.

Try Distributed Training

We’re assuming that you’ve successfully built Caffe2 and that you have a system with at least one GPU, but preferably more to test out the distributed features.

First get yourself a training image database ready in lmdb database or LevelDB format. You can browse and download a variety of datasets here.

resnet50_trainer.py script default output:

Usage

  1. usage: resnet50_trainer.py [-h] --train_data TRAIN_DATA
  2. [--test_data TEST_DATA] [--db_type DB_TYPE]
  3. [--gpus GPUS] [--num_gpus NUM_GPUS]
  4. [--num_channels NUM_CHANNELS]
  5. [--image_size IMAGE_SIZE] [--num_labels NUM_LABELS]
  6. [--batch_size BATCH_SIZE] [--epoch_size EPOCH_SIZE]
  7. [--num_epochs NUM_EPOCHS]
  8. [--base_learning_rate BASE_LEARNING_RATE]
  9. [--weight_decay WEIGHT_DECAY]
  10. [--num_shards NUM_SHARDS] [--shard_id SHARD_ID]
  11. [--run_id RUN_ID] [--redis_host REDIS_HOST]
  12. [--redis_port REDIS_PORT]
  13. [--file_store_path FILE_STORE_PATH]
Required
—train_data the path to the database of training data
Optional
—test_data the path to the database of test data
—db_type either lmdb or leveldb, defaults to lmdb
—gpus a list of GPU IDs, where 0 would be the first GPU device #, comma separated
—num_gpus an integer for the total number of GPUs; alternative to using a list with gpus
—num_channels number of color channels, defaults to 3
—image_size the height or width in pixels of the input images, assumes they’re square, defaults to 227, might not handle small sizes
—num_labels number of labels, defaults to 1000
—batch_size batch size, total over all GPUs, defaults to 32, expand as you increase GPUs
—epoch_size number of images per epoch, defaults to 1.5MM (1500000), definitely change this
—num_epochs number of epochs
—base_learning_rate initial learning rate, defaults to 0.1 (based on 256 global batch size)
—weight_decay weight decay (L2 regularization)
—num_shards number of machines in a distributed run, defaults to 1
—shard_id shard/node id, defaults to 0, next node would be 1, and so forth
—run_id RUN_ID unique run identifier, e.g. uuid
—redis_host host of Redis server (for rendezvous)
—redis_port Redis port for rendezvous
—file_store_path alternative to Redis, (NFS) path to shared directory to use for rendezvous temp files to coordinate between each shard/node

Arguments for preliminary testing:

  • —train_data (required)
  • —db_type (default=lmdb)
  • —num_gpus <#> (use this instead of listing out each one with —gpus)
  • —batch_size (default=32)
  • —test_data (optional)
    The only required parameter is the training database. You can try that out first with no other parameters if you have your training set already in lmdb.
  1. python resnet50_trainer.py --train_data <location of lmdb training database>

Using LevelDB:

  1. python resnet50_trainer.py --train_data <location of leveldb training database> --db_type leveldb

The script uses a default batch size of 32. When using 2 GPUs you want to increase the batch size according to the number of GPUs, so that you’re using as much of the memory on the GPU as possible. In the case of using 2 GPUs as in the example below, we double the batch size to 64:

  1. python resnet50_trainer.py --train_data <location of lmdb training database> --num_gpus 2 --batch_size 64

You will notice that when you add the second GPU and double the batch size the number of iterations per epoch is half.

Using nvidia-smi you can examine the GPUs’ current status and see if you’re properly maxing it out in each run. Try running watch -n1 nvidia-smi to continuously report the status while you run different experiments.

If you add the —test_data parameter you will get occasional test runs intermingled which can provide a nice metric on how well the neural network is doing at that time. It’ll give you accuracy numbers and help you assess convergence.

Logging

As you run the script and training progresses, you will notice log files are deposited in the same folder. The naming convention will give you an idea of what that particular run’s parameters were. For example:

  1. resnet50_gpu2_b64_L1000_lr0.10_v2_20170411_141617.log

You can infer from this filename that the parameters were: —gpus 2, —batch_size 64, num_labels 1000, —base_learning_rate 0.10, followed by a timestamp.

When opening the log file you will find the parameters used for that run, a header to let you know what the comma separated values represent, and finally the log data.

The list of values recorded in the log:time_spent,cumulative_time_spent,input_count,cumulative_input_count,cumulative_batch_count,inputs_per_sec,accuracy,epoch,learning_rate,loss,test_accuracy

test_accuracy will be set to -1 if you don’t use test data, otherwise it will populate with an accuracy number.

Conclusion

There are many other parameters you can tinker with using this script. We look forward to hearing from you about your experiments and to see your models appear on Caffe2’s Model Zoo!