The OFRecord Data Format
Deep Learning applications need complex multi-stage data preprocessing pipeline, the first step of data pipeline is data loading. OneFlow supports multiple data formats in data loading, among which OFRecord
format is the native data format of OneFlow.
The data format definition of OFRecord
is similar to TFRecord of Tensorflow. Users familiar with TFRecord
can start with OneFlow’s OFRecord
quickly.
Key points of this article:
The data type used in OFRecord
How to convert data to OFRecord object and serialize it
The file format of OFRecord
It should be helpful for users to learn how to make ofdataset after learning the above contents.
Data Types of OFRecord
Internally, OneFlow use Protocol Buffers to describe the serialization format of OFRecord. The related definitions can be found in the oneflow/core/record/record.proto
file:
syntax = "proto2";
package oneflow;
message BytesList {
repeated bytes value = 1;
}
message FloatList {
repeated float value = 1 [packed = true];
}
message DoubleList {
repeated double value = 1 [packed = true];
}
message Int32List {
repeated int32 value = 1 [packed = true];
}
message Int64List {
repeated int64 value = 1 [packed = true];
}
message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
DoubleList double_list = 3;
Int32List int32_list = 4;
Int64List int64_list = 5;
}
}
message OFRecord {
map<string, Feature> feature = 1;
}
Firstly let’s explain the above important data types in details:
OFRecord: the instantiated object of OFRecord, which can be used to store all data that need to be serialized. It is composed of many key-value pairs of string->Feature;
Feature: can store one of the data type including BytesList, FloatList, DoubleList, Int32List, Int64List;
The corresponding interfaces with the same name of OFRecord, Feature, XXXList and other data types will be generated by
Protocol Buffers
, making it possible to build corresponding objects at the Python level.
Convert Data into Feature Format
Users can convert data to Feature
format with the invocation of ofrecord.xxxList
and ofrecord.Feature
. We also encapsulate the interface generated by protocol buffers
to make it more convenient for users.
import oneflow.core.record.record_pb2 as ofrecord
def int32_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(int32_list=ofrecord.Int32List(value=value))
def int64_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(int64_list=ofrecord.Int64List(value=value))
def float_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(float_list=ofrecord.FloatList(value=value))
def double_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(double_list=ofrecord.DoubleList(value=value))
def bytes_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
if not six.PY2:
if isinstance(value[0], str):
value = [x.encode() for x in value]
return ofrecord.Feature(bytes_list=ofrecord.BytesList(value=value))
Creating and Serializing OFRecord Object
In the following example, we will create an OFRecord
object which contains two features then serialize with its SerializeToString
method
obserations = 28 * 28
f = open("./dataset/part-0", "wb")
for loop in range(0, 3):
image = [random.random() for x in range(0, obserations)]
label = [random.randint(0, 9)]
topack = {
"images": float_feature(image),
"labels": int64_feature(label),
}
ofrecord_features = ofrecord.OFRecord(feature=topack)
serilizedBytes = ofrecord_features.SerializeToString()
With the above example, we can summarize the steps for serializing data:
First, users can convert data which needs to be serialized to a
Feature
object with the invocation ofofrecord.Feature
andofrecord.XXXList
Second, store the Feature objects obtained in the previous step as
string->Feature
key-value format in Python dictThird, create
OFRecord
object with the invocation ofofrecord.OFRecord
Last, get the serialized result of OFRecord object with its
SerializeToString
method
The serialized result can be saved as a file with ofrecord format.
OFRecord Format File
According to the format convention of OneFlow, users can get OFRecord file after serializing the OFRecord object.
Multiple OFRecord objects can be stored in one OFRecord file which can be used in OneFlow data-pipeline
. The specific operations can be seen at how to make ofrecord dataset.
According to the OneFlow convention, each OFRecord object is stored in the following format.
uint64 length
byte data[length]
The length of the data are stored in the first eight bytes and then followed by the serialized data.
length = ofrecord_features.ByteSize()
f.write(struct.pack("q", length))
f.write(serilizedBytes)
Code
The following complete code shows how to generate an OFRecord file and then manually read datas by calling the OFRecord
interface generated by protobuf
.
Actually, OneFlow provides flow.data.decode_ofrecord
and other interfaces, which can more easily extract the contents of OFRecord files(dataset). See how to make ofrecord dataset for details.
Write OFRecord Object to File
In the following code, we randomly generate 3 samples and their corresponding labels, each sample is a 28*28
picture. After these three samples are converted into OFRecord objects, they are stored in the file according to the OneFlow convention format.
Complete code:ofrecord_to_string.py
Read data from OFRecord file
The code below shows how to parse and read data from OFRecord
file generated in the above example. Getting the OFRecord
object by calling the FromString
method to deserialize the file contents then display the data:
The complete code:ofrecord_from_string.py