目标检测是计算机视觉的一项重要任务,本教程使用JITTOR框架参考[1]实现了SSD目标检测模型[2]。

SSD论文链接:https://arxiv.org/pdf/1512.02325.pdf

完整代码:https://github.com/Jittor/ssd-jittor

1. 数据集

1.1 数据准备

VOC数据集是目标检测、语义分割等任务常用的数据集之一,本教程使用VOC数据集的2007 trainval2012 trainval作为训练集,2007 test作为验证集和测试集。您可以从下面的链接下载数据。

VOC数据集中的物体共包括20个类别:'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor'

示例5:目标检测之 SSD - 图1

将三个文件解压在同一文件夹data下,并使用utils.py里的create_data_lists()函数生成训练所需的json文件。函数的参数voc07_pathvoc12_path分别是./data/VOCdevkit/VOC2007/./data/VOCdevkit/VOC2012/output_folder可自行设置,比如./dataset/,您将在output_folder中得到label_map.jsonTEST_images.jsonTEST_objects.jsonTRAIN_images.jsonTRAIN_objects.json五个文件。

最终数据集的文件组织如下。

  1. # 文件组织
  2. 根目录
  3. |----data
  4. | |----VOCdevkit
  5. | | |----VOC2007
  6. | | | |----Annotations
  7. | | | |----ImageSets
  8. | | | |----JPEGImages
  9. | | | |----SegmentationClass
  10. | | | |----SegmentationObject
  11. | | |----VOC2012
  12. | | |----Annotations
  13. | | |----ImageSets
  14. | | |----JPEGImages
  15. | | |----SegmentationClass
  16. | | |----SegmentationObject
  17. |----dataset
  18. |----label_map.json
  19. |----TEST_images.json
  20. |----TEST_objects.json
  21. |----TRAIN_images.json
  22. |----TRAIN_objects.json

1.2 数据加载

使用jittor.dataset.dataset的基类Dataset可以构造自己的数据集,需要实现initgetitemlen以及collate_batch等函数。

  • init: 定义数据路径,这里的data_folder需设置为之前您设定的output_folder路径。同时需要调用self.set_attrs来指定数据集加载所需的参数batch_sizetotal_lenshuffle
  • getitem: 返回单个item的数据。
  • len: 返回数据集的数据总数。
  • collate_batch: 由于训练集中不同的图片的gt框个数不同,需要重写collate_batch函数将不同itemboxeslabels放入list,返回batch_size的数据。
  1. from jittor.dataset.dataset import Dataset
  2. import json
  3. import os
  4. import cv2
  5. import numpy as np
  6. from utils import random_crop, random_bright, random_swap, random_contrast, random_saturation, random_hue, random_flip, random_expand
  7. import random
  8. class PascalVOCDataset(Dataset):
  9. def __init__(self, data_folder, split, keep_difficult=False, batch_size=1, shuffle=False):
  10. self.split = split.upper()
  11. assert self.split in {'TRAIN', 'TEST'}
  12. self.data_folder = data_folder
  13. self.keep_difficult = keep_difficult
  14. self.batch_size = batch_size
  15. self.shuffle = shuffle
  16. self.mean = [0.485, 0.456, 0.406]
  17. self.std = [0.229, 0.224, 0.225]
  18. with open(os.path.join(data_folder, self.split + '_images.json'), 'r') as j:
  19. self.images = json.load(j)
  20. with open(os.path.join(data_folder, self.split + '_objects.json'), 'r') as j:
  21. self.objects = json.load(j)
  22. assert len(self.images) == len(self.objects)
  23. self.total_len = len(self.images)
  24. self.set_attrs(batch_size = self.batch_size, total_len = self.total_len, shuffle = self.shuffle)
  25. def __getitem__(self, i):
  26. image = cv2.imread(self.images[i])
  27. image = cv2.cvtColor(image,cv2.COLOR_BGR2RGB).astype("float32")
  28. objects = self.objects[i]
  29. boxes = np.array(objects['boxes']).astype("float32")
  30. labels = np.array(objects['labels'])
  31. difficulties = np.array(objects['difficulties'])
  32. if not self.keep_difficult:
  33. boxes = boxes[1 - difficulties]
  34. labels = labels[1 - difficulties]
  35. difficulties = difficulties[1 - difficulties]
  36. # 数据增强
  37. if self.split == 'TRAIN':
  38. data_enhance = [random_bright, random_swap, random_contrast, random_saturation, random_swap]
  39. random.shuffle(data_enhance)
  40. for d in data_enhance:
  41. image = d(image)
  42. image, boxes = random_expand(image, boxes, filler=self.mean)
  43. image, boxes, labels, difficulties = random_crop(image, boxes, labels, difficulties)
  44. image, boxes = random_flip(image, boxes)
  45. height, width, _ = image.shape
  46. image = cv2.resize(image, (300, 300))
  47. image /= 255.
  48. image = (image - self.mean) / self.std
  49. image = image.transpose((2,0,1)).astype("float32")
  50. boxes[:,[0,2]] /= width
  51. boxes[:,[1,3]] /= height
  52. return image, boxes, labels, difficulties
  53. def __len__(self):
  54. return len(self.images)
  55. def collate_batch(self, batch):
  56. images = list()
  57. boxes = list()
  58. labels = list()
  59. difficulties = list()
  60. for b in batch:
  61. images.append(b[0])
  62. boxes.append(b[1])
  63. labels.append(b[2])
  64. difficulties.append(b[3])
  65. images = np.stack(images, axis=0)
  66. return images, boxes, labels, difficulties

2. 模型定义

示例5:目标检测之 SSD - 图2

上图为SSD论文给出的网络架构图。

本教程采用VGG-16[3]为backbone。输入图像尺寸为300300,采用VGG-16的中间层特征conv4_3conv7以及Extra Feature Layers (AuxiliaryConvolutions)的中间特征层conv8_2conv9_2conv10_2conv11_2conv4_3conv7conv8_2conv9_2conv10_2conv11_2的特征图大小分别是38381919101055331*1,锚框Prior的scale分别为0.10.20.3750.550.7250.9,特征图上每个点产生的Prior数目分别为4、6、6、6、4、4,最终每个特征图产生的Prior数目为5776、2166、600、150、36、4。总计有8732个Prior。

  1. class VGGBase(nn.Module):
  2. def __init__(self):
  3. super(VGGBase, self).__init__()
  4. self.conv1_1 = nn.Conv(3, 64, kernel_size=3, padding=1)
  5. self.conv1_2 = nn.Conv(64, 64, kernel_size=3, padding=1)
  6. self.pool1 = nn.Pool(kernel_size=2, stride=2, op='maximum')
  7. self.conv2_1 = nn.Conv(64, 128, kernel_size=3, padding=1)
  8. self.conv2_2 = nn.Conv(128, 128, kernel_size=3, padding=1)
  9. self.pool2 = nn.Pool(kernel_size=2, stride=2, op='maximum')
  10. self.conv3_1 = nn.Conv(128, 256, kernel_size=3, padding=1)
  11. self.conv3_2 = nn.Conv(256, 256, kernel_size=3, padding=1)
  12. self.conv3_3 = nn.Conv(256, 256, kernel_size=3, padding=1)
  13. self.pool3 = nn.Pool(kernel_size=2, stride=2, ceil_mode=True, op='maximum')
  14. self.conv4_1 = nn.Conv(256, 512, kernel_size=3, padding=1)
  15. self.conv4_2 = nn.Conv(512, 512, kernel_size=3, padding=1)
  16. self.conv4_3 = nn.Conv(512, 512, kernel_size=3, padding=1)
  17. self.pool4 = nn.Pool(kernel_size=2, stride=2, op='maximum')
  18. self.conv5_1 = nn.Conv(512, 512, kernel_size=3, padding=1)
  19. self.conv5_2 = nn.Conv(512, 512, kernel_size=3, padding=1)
  20. self.conv5_3 = nn.Conv(512, 512, kernel_size=3, padding=1)
  21. self.pool5 = nn.Pool(kernel_size=3, stride=1, padding=1, op='maximum')
  22. self.conv6 = nn.Conv(512, 1024, kernel_size=3, padding=6, dilation=6)
  23. self.conv7 = nn.Conv(1024, 1024, kernel_size=1)
  24. def execute(self, image):
  25. out = nn.relu(self.conv1_1(image))
  26. out = nn.relu(self.conv1_2(out))
  27. out = self.pool1(out)
  28. out = nn.relu(self.conv2_1(out))
  29. out = nn.relu(self.conv2_2(out))
  30. out = self.pool2(out)
  31. out = nn.relu(self.conv3_1(out))
  32. out = nn.relu(self.conv3_2(out))
  33. out = nn.relu(self.conv3_3(out))
  34. out = self.pool3(out)
  35. out = nn.relu(self.conv4_1(out))
  36. out = nn.relu(self.conv4_2(out))
  37. out = nn.relu(self.conv4_3(out))
  38. conv4_3_feats = out
  39. out = self.pool4(out)
  40. out = nn.relu(self.conv5_1(out))
  41. out = nn.relu(self.conv5_2(out))
  42. out = nn.relu(self.conv5_3(out))
  43. out = self.pool5(out)
  44. out = nn.relu(self.conv6(out))
  45. conv7_feats = nn.relu(self.conv7(out))
  46. return (conv4_3_feats, conv7_feats)
  47. class AuxiliaryConvolutions(nn.Module):
  48. def __init__(self):
  49. super(AuxiliaryConvolutions, self).__init__()
  50. self.conv8_1 = nn.Conv(1024, 256, kernel_size=1, padding=0)
  51. self.conv8_2 = nn.Conv(256, 512, kernel_size=3, stride=2, padding=1)
  52. self.conv9_1 = nn.Conv(512, 128, kernel_size=1, padding=0)
  53. self.conv9_2 = nn.Conv(128, 256, kernel_size=3, stride=2, padding=1)
  54. self.conv10_1 = nn.Conv(256, 128, kernel_size=1, padding=0)
  55. self.conv10_2 = nn.Conv(128, 256, kernel_size=3, padding=0)
  56. self.conv11_1 = nn.Conv(256, 128, kernel_size=1, padding=0)
  57. self.conv11_2 = nn.Conv(128, 256, kernel_size=3, padding=0)
  58. def execute(self, conv7_feats):
  59. out = nn.relu(self.conv8_1(conv7_feats))
  60. out = nn.relu(self.conv8_2(out))
  61. conv8_2_feats = out
  62. out = nn.relu(self.conv9_1(out))
  63. out = nn.relu(self.conv9_2(out))
  64. conv9_2_feats = out
  65. out = nn.relu(self.conv10_1(out))
  66. out = nn.relu(self.conv10_2(out))
  67. conv10_2_feats = out
  68. out = nn.relu(self.conv11_1(out))
  69. conv11_2_feats = nn.relu(self.conv11_2(out))
  70. return (conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats)

PredictionConvolutions将上述的6个Feature map经过若干层卷积操作最终concat在一起形成[bs, 8732, 4]locs信息以及[bs, 8732, 1]classes_scores信息。

  1. class PredictionConvolutions(nn.Module):
  2. def __init__(self, n_classes):
  3. super(PredictionConvolutions, self).__init__()
  4. self.n_classes = n_classes
  5. n_boxes = {
  6. 'conv4_3': 4,
  7. 'conv7': 6,
  8. 'conv8_2': 6,
  9. 'conv9_2': 6,
  10. 'conv10_2': 4,
  11. 'conv11_2': 4,
  12. }
  13. self.loc_conv4_3 = nn.Conv(512, (n_boxes['conv4_3'] * 4), kernel_size=3, padding=1)
  14. self.loc_conv7 = nn.Conv(1024, (n_boxes['conv7'] * 4), kernel_size=3, padding=1)
  15. self.loc_conv8_2 = nn.Conv(512, (n_boxes['conv8_2'] * 4), kernel_size=3, padding=1)
  16. self.loc_conv9_2 = nn.Conv(256, (n_boxes['conv9_2'] * 4), kernel_size=3, padding=1)
  17. self.loc_conv10_2 = nn.Conv(256, (n_boxes['conv10_2'] * 4), kernel_size=3, padding=1)
  18. self.loc_conv11_2 = nn.Conv(256, (n_boxes['conv11_2'] * 4), kernel_size=3, padding=1)
  19. self.cl_conv4_3 = nn.Conv(512, (n_boxes['conv4_3'] * n_classes), kernel_size=3, padding=1)
  20. self.cl_conv7 = nn.Conv(1024, (n_boxes['conv7'] * n_classes), kernel_size=3, padding=1)
  21. self.cl_conv8_2 = nn.Conv(512, (n_boxes['conv8_2'] * n_classes), kernel_size=3, padding=1)
  22. self.cl_conv9_2 = nn.Conv(256, (n_boxes['conv9_2'] * n_classes), kernel_size=3, padding=1)
  23. self.cl_conv10_2 = nn.Conv(256, (n_boxes['conv10_2'] * n_classes), kernel_size=3, padding=1)
  24. self.cl_conv11_2 = nn.Conv(256, (n_boxes['conv11_2'] * n_classes), kernel_size=3, padding=1)
  25. def execute(self, conv4_3_feats, conv7_feats, conv8_2_feats, conv9_2_feats, conv10_2_feats, conv11_2_feats):
  26. batch_size = conv4_3_feats.shape[0]
  27. l_conv4_3 = self.loc_conv4_3(conv4_3_feats)
  28. l_conv4_3 = jt.transpose(l_conv4_3, [0, 2, 3, 1])
  29. l_conv4_3 = jt.reshape(l_conv4_3, [batch_size, -1, 4])
  30. l_conv7 = self.loc_conv7(conv7_feats)
  31. l_conv7 = jt.transpose(l_conv7, [0, 2, 3, 1])
  32. l_conv7 = jt.reshape(l_conv7, [batch_size, -1, 4])
  33. l_conv8_2 = self.loc_conv8_2(conv8_2_feats)
  34. l_conv8_2 = jt.transpose(l_conv8_2, [0, 2, 3, 1])
  35. l_conv8_2 = jt.reshape(l_conv8_2, [batch_size, -1, 4])
  36. l_conv9_2 = self.loc_conv9_2(conv9_2_feats)
  37. l_conv9_2 = jt.transpose(l_conv9_2, [0, 2, 3, 1])
  38. l_conv9_2 = jt.reshape(l_conv9_2, [batch_size, -1, 4])
  39. l_conv10_2 = self.loc_conv10_2(conv10_2_feats)
  40. l_conv10_2 = jt.transpose(l_conv10_2, [0, 2, 3, 1])
  41. l_conv10_2 = jt.reshape(l_conv10_2, [batch_size, -1, 4])
  42. l_conv11_2 = self.loc_conv11_2(conv11_2_feats)
  43. l_conv11_2 = jt.transpose(l_conv11_2, [0, 2, 3, 1])
  44. l_conv11_2 = jt.reshape(l_conv11_2, [batch_size, -1, 4])
  45. c_conv4_3 = self.cl_conv4_3(conv4_3_feats)
  46. c_conv4_3 = jt.transpose(c_conv4_3, [0, 2, 3, 1])
  47. c_conv4_3 = jt.reshape(c_conv4_3, [batch_size, -1, self.n_classes])
  48. c_conv7 = self.cl_conv7(conv7_feats)
  49. c_conv7 = jt.transpose(c_conv7, [0, 2, 3, 1])
  50. c_conv7 = jt.reshape(c_conv7, [batch_size, -1, self.n_classes])
  51. c_conv8_2 = self.cl_conv8_2(conv8_2_feats)
  52. c_conv8_2 = jt.transpose(c_conv8_2, [0, 2, 3, 1])
  53. c_conv8_2 = jt.reshape(c_conv8_2, [batch_size, -1, self.n_classes])
  54. c_conv9_2 = self.cl_conv9_2(conv9_2_feats)
  55. c_conv9_2 = jt.transpose(c_conv9_2, [0, 2, 3, 1])
  56. c_conv9_2 = jt.reshape(c_conv9_2, [batch_size, -1, self.n_classes])
  57. c_conv10_2 = self.cl_conv10_2(conv10_2_feats)
  58. c_conv10_2 = jt.transpose(c_conv10_2, [0, 2, 3, 1])
  59. c_conv10_2 = jt.reshape(c_conv10_2, [batch_size, -1, self.n_classes])
  60. c_conv11_2 = self.cl_conv11_2(conv11_2_feats)
  61. c_conv11_2 = jt.transpose(c_conv11_2, [0, 2, 3, 1])
  62. c_conv11_2 = jt.reshape(c_conv11_2, [batch_size, -1, self.n_classes])
  63. locs = jt.contrib.concat([l_conv4_3, l_conv7, l_conv8_2, l_conv9_2, l_conv10_2, l_conv11_2], dim=1)
  64. classes_scores = jt.contrib.concat([c_conv4_3, c_conv7, c_conv8_2, c_conv9_2, c_conv10_2, c_conv11_2], dim=1)
  65. return (locs, classes_scores)

3. 模型训练

模型训练参数设定如下:

  1. # parameters
  2. batch_size = 20 # batch大小
  3. iterations = 120000 # 一共要训的轮数
  4. decay_lr_at = [80000, 100000] # 在这些轮的时候学习率乘以0.1
  5. start_epoch = 0 # 开始epoch
  6. print_freq = 20 # train的时候,多少个iter打印一次信息
  7. lr = 3e-4 # 学习率
  8. momentum = 0.9 # SGD的momentum
  9. weight_decay = 5e-4 # SGD的weight_decay
  10. grad_clip = 1 # 设置是否要把梯度clamp到[-grad_clip, grad_clip],如果是None则不clip

定义模型、优化器、损失函数、训练/验证数据加载器。

  1. model = SSD300(n_classes=n_classes)
  2. optimizer = nn.SGD(model.parameters(),
  3. lr,
  4. momentum=momentum,
  5. weight_decay=weight_decay)
  6. criterion = MultiBoxLoss(priors_cxcy=model.priors_cxcy)
  7. train_loader = PascalVOCDataset(data_folder,
  8. split='train',
  9. keep_difficult=keep_difficult,
  10. batch_size=batch_size,
  11. shuffle=False)
  12. val_loader = PascalVOCDataset(data_folder,
  13. split='test',
  14. keep_difficult=keep_difficult,
  15. batch_size=batch_size,
  16. shuffle=False)
  17. epochs = iterations // (len(train_loader) // 32) # 原论文batch_size为32跑了120000个iters,由此计算出epoches
  18. decay_lr_at = [it // (len(train_loader) // 32) for it in decay_lr_at] # 并计算出需要降lr的epochs集合
  19. for epoch in range(epochs):
  20. if epoch in decay_lr_at:
  21. optimizer.lr *= 0.1
  22. train(train_loader=train_loader,
  23. model=model,
  24. criterion=criterion,
  25. optimizer=optimizer,
  26. epoch=epoch)
  27. if epoch % 5 == 0 and epoch > 0:
  28. evaluate(test_loader=val_loader, model=model)

损失函数设计:使用L1损失函数L1Loss监督predicted_locsboxes的差距,使用交叉熵损失函数CrossEntropyLoss来监督predicted_scoreslabels的差距。其中正负样本比例为1:3

  1. class L1Loss(nn.Module):
  2. def __init__(self, size_average=None, reduce=None, reduction='mean'):
  3. self.size_average = size_average
  4. self.reduce = reduce
  5. self.reduction = reduction
  6. def execute(self, input, target):
  7. ret = jt.abs(input - target)
  8. if self.reduction != None:
  9. ret = jt.mean(ret) if self.reduction == 'mean' else jt.sum(ret)
  10. return ret
  11. class CrossEntropyLoss(nn.Module):
  12. def __init__(self, weight=None, size_average=None, ignore_index=-100, reduce=None, reduction='mean'):
  13. self.ignore_index = ignore_index
  14. self.reduction = reduction
  15. def execute(self, input, target):
  16. bs_idx = jt.array(range(input.shape[0]))
  17. ret = (- jt.log(nn.softmax(input, dim=1)))[bs_idx, target]
  18. if self.reduction != None:
  19. ret = jt.mean(ret) if self.reduction == 'mean' else jt.sum(ret)
  20. return ret
  21. class MultiBoxLoss(nn.Module):
  22. def __init__(self, priors_cxcy, threshold=0.5, neg_pos_ratio=3, alpha=1.0):
  23. super(MultiBoxLoss, self).__init__()
  24. self.priors_cxcy = priors_cxcy
  25. self.priors_xy = cxcy_to_xy(priors_cxcy)
  26. self.threshold = threshold
  27. self.neg_pos_ratio = neg_pos_ratio
  28. self.alpha = alpha
  29. self.smooth_l1 = L1Loss()
  30. self.cross_entropy = CrossEntropyLoss(reduce=False, reduction=None)
  31. def execute(self, predicted_locs, predicted_scores, boxes, labels):
  32. # 使用L1Loss监督predicted_locs和boxes的差距
  33. loc_loss = self.smooth_l1(
  34. (predicted_locs * positive_priors.broadcast([1,1,4], [2])),
  35. (true_locs * positive_priors.broadcast([1,1,4], [2]))
  36. )
  37. # 使用交叉熵CrossEntropyLoss来监督predicted_scores和labels的差距
  38. conf_loss_all = self.cross_entropy(
  39. jt.reshape(predicted_scores, [-1, n_classes]), jt.reshape(true_classes, [-1,])
  40. )
  41. # ... 省略部分代码
  42. conf_loss = ((conf_loss_hard_neg.sum() + conf_loss_pos.sum()) / n_positives.float32().sum())
  43. return (conf_loss + (self.alpha * loc_loss))

模型训练。

  1. def train(train_loader, model, criterion, optimizer, epoch):
  2. global best_loss, exp_id
  3. model.train()
  4. for i, (images, boxes, labels, _) in enumerate(train_loader):
  5. images = jt.array(images)
  6. predicted_locs, predicted_scores = model(images)
  7. loss, conf_loss, loc_loss = criterion(predicted_locs, predicted_scores, boxes, labels)
  8. if grad_clip is not None:
  9. optimizer.grad_clip = grad_clip
  10. optimizer.step(loss)
  11. if i % print_freq == 0:
  12. print(f'Experiment id: {exp_id} || Epochs: [{epoch}/{epochs}] || Iters: [{i}/{length}] || Loss: {loss} || Best mAP: {best_mAP}')
  13. writer.add_scalar('Train/Loss', loss.data[0], global_step=i + epoch * length)
  14. writer.add_scalar('Train/Loss_conf', conf_loss.data[0], global_step=i + epoch * length)
  15. writer.add_scalar('Train/Loss_loc', loc_loss.data[0], global_step=i + epoch * length)

4. 结果

我们在VOC的test数据集(4952张图片)上测试了mAP,下表是在不同的类别上Pytorh与Jittor的mAP对比。最后一行是平均mAP。

类别 mAP (Pytorch) mAP (Jittor)
aeroplane 0.7886 0.7900
bicycle 0.8303 0.8285
bird 0.7593 0.7452
boat 0.7100 0.6903
bottle 0.4408 0.4369
bus 0.8403 0.8495
car 0.8470 0.8464
cat 0.8768 0.8725
chair 0.5668 0.5737
cow 0.8267 0.8204
diningtable 0.7365 0.7546
dog 0.8494 0.8550
horse 0.8689 0.8769
motorbike 0.8246 0.8214
person 0.7724 0.7694
pottedplant 0.4872 0.4987
sheep 0.7448 0.7643
sofa 0.7637 0.7532
train 0.8377 0.8517
tvmonitor 0.7499 0.7373
average 0.756 0.757

下面展示了一些图片测试的结果。

示例5:目标检测之 SSD - 图3示例5:目标检测之 SSD - 图4示例5:目标检测之 SSD - 图5示例5:目标检测之 SSD - 图6示例5:目标检测之 SSD - 图7示例5:目标检测之 SSD - 图8示例5:目标检测之 SSD - 图9示例5:目标检测之 SSD - 图10

参考文献

[1] https://github.com/sgrvinod/a-PyTorch-Tutorial-to-Object-Detection

[2] Liu, Wei, et al. “Ssd: Single shot multibox detector.” European conference on computer vision. Springer, Cham, 2016.

[3] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.