The OFRecord Data Format

Deep Learning applications need complex multi-stage data preprocessing pipeline, the first step of data pipeline is data loading. OneFlow supports multiple data formats in data loading, among which OFRecord format is the native data format of OneFlow.

The data format definition of OFRecord is similar to TFRecord of Tensorflow. Users familiar with TFRecord can start with OneFlow’s OFRecord quickly.

Key points of this article:

  • The data type used in OFRecord

  • How to convert data to OFRecord object and serialize it

  • The file format of OFRecord

It should be helpful for users to learn how to make ofdataset after learning the above contents.

Data Types of OFRecord

Internally, OneFlow use Protocol Buffers to describe the serialization format of OFRecord. The related definitions can be found in the oneflow/core/record/record.proto file:

  1. syntax = "proto2";
  2. package oneflow;
  3. message BytesList {
  4. repeated bytes value = 1;
  5. }
  6. message FloatList {
  7. repeated float value = 1 [packed = true];
  8. }
  9. message DoubleList {
  10. repeated double value = 1 [packed = true];
  11. }
  12. message Int32List {
  13. repeated int32 value = 1 [packed = true];
  14. }
  15. message Int64List {
  16. repeated int64 value = 1 [packed = true];
  17. }
  18. message Feature {
  19. oneof kind {
  20. BytesList bytes_list = 1;
  21. FloatList float_list = 2;
  22. DoubleList double_list = 3;
  23. Int32List int32_list = 4;
  24. Int64List int64_list = 5;
  25. }
  26. }
  27. message OFRecord {
  28. map<string, Feature> feature = 1;
  29. }

Firstly let’s explain the above important data types in details:

  • OFRecord: the instantiated object of OFRecord, which can be used to store all data that need to be serialized. It is composed of many key-value pairs of string->Feature;

  • Feature: can store one of the data type including BytesList, FloatList, DoubleList, Int32List, Int64List;

  • The corresponding interfaces with the same name of OFRecord, Feature, XXXList and other data types will be generated by Protocol Buffers, making it possible to build corresponding objects at the Python level.

Convert Data into Feature Format

Users can convert data to Feature format with the invocation of ofrecord.xxxList and ofrecord.Feature. We also encapsulate the interface generated by protocol buffers to make it more convenient for users.

  1. import oneflow.core.record.record_pb2 as ofrecord
  2. def int32_feature(value):
  3. if not isinstance(value, (list, tuple)):
  4. value = [value]
  5. return ofrecord.Feature(int32_list=ofrecord.Int32List(value=value))
  6. def int64_feature(value):
  7. if not isinstance(value, (list, tuple)):
  8. value = [value]
  9. return ofrecord.Feature(int64_list=ofrecord.Int64List(value=value))
  10. def float_feature(value):
  11. if not isinstance(value, (list, tuple)):
  12. value = [value]
  13. return ofrecord.Feature(float_list=ofrecord.FloatList(value=value))
  14. def double_feature(value):
  15. if not isinstance(value, (list, tuple)):
  16. value = [value]
  17. return ofrecord.Feature(double_list=ofrecord.DoubleList(value=value))
  18. def bytes_feature(value):
  19. if not isinstance(value, (list, tuple)):
  20. value = [value]
  21. if not six.PY2:
  22. if isinstance(value[0], str):
  23. value = [x.encode() for x in value]
  24. return ofrecord.Feature(bytes_list=ofrecord.BytesList(value=value))

Creating and Serializing OFRecord Object

In the following example, we will create an OFRecord object which contains two features then serialize with its SerializeToString method

  1. obserations = 28 * 28
  2. f = open("./dataset/part-0", "wb")
  3. for loop in range(0, 3):
  4. image = [random.random() for x in range(0, obserations)]
  5. label = [random.randint(0, 9)]
  6. topack = {
  7. "images": float_feature(image),
  8. "labels": int64_feature(label),
  9. }
  10. ofrecord_features = ofrecord.OFRecord(feature=topack)
  11. serilizedBytes = ofrecord_features.SerializeToString()

With the above example, we can summarize the steps for serializing data:

  • First, users can convert data which needs to be serialized to a Feature object with the invocation of ofrecord.Feature and ofrecord.XXXList

  • Second, store the Feature objects obtained in the previous step as string->Feature key-value format in Python dict

  • Third, create OFRecord object with the invocation of ofrecord.OFRecord

  • Last, get the serialized result of OFRecord object with its SerializeToString method

The serialized result can be saved as a file with ofrecord format.

OFRecord Format File

According to the format convention of OneFlow, users can get OFRecord file after serializing the OFRecord object.

Multiple OFRecord objects can be stored in one OFRecord file which can be used in OneFlow data-pipeline. The specific operations can be seen at how to make ofrecord dataset.

According to the OneFlow convention, each OFRecord object is stored in the following format.

  1. uint64 length
  2. byte data[length]

The length of the data are stored in the first eight bytes and then followed by the serialized data.

  1. length = ofrecord_features.ByteSize()
  2. f.write(struct.pack("q", length))
  3. f.write(serilizedBytes)

Code

The following complete code shows how to generate an OFRecord file and then manually read datas by calling the OFRecord interface generated by protobuf.

Actually, OneFlow provides flow.data.decode_ofrecord and other interfaces, which can more easily extract the contents of OFRecord files(dataset). See how to make ofrecord dataset for details.

Write OFRecord Object to File

In the following code, we randomly generate 3 samples and their corresponding labels, each sample is a 28*28 picture. After these three samples are converted into OFRecord objects, they are stored in the file according to the OneFlow convention format.

Complete code:ofrecord_to_string.py

Read data from OFRecord file

The code below shows how to parse and read data from OFRecord file generated in the above example. Getting the OFRecord object by calling the FromString method to deserialize the file contents then display the data:

The complete code:ofrecord_from_string.py