- Why readers?
- Usage
- Listeners
- Types of readers
- ConcatenatingRecordReader
- FileRecordReader
- LineRecordReader
- CollectionRecordReader
- CollectionSequenceRecordReader
- ListStringRecordReader
- CSVRecordReader
- CSVRegexRecordReader
- CSVSequenceRecordReader
- CSVVariableSlidingWindowRecordReader
- LibSvmRecordReader
- MatlabRecordReader
- SVMLightRecordReader
- RegexLineRecordReader
- RegexSequenceRecordReader
- TransformProcessRecordReader
- TransformProcessSequenceRecordReader
- NativeAudioRecordReader
- WavFileRecordReader
- ImageRecordReader
- TfidfRecordReader
Why readers?
Readers iterate records from a dataset in storage and load the data into DataVec. The usefulness of readers beyond individual entries in a dataset includes: what if you wanted to train a text generator on a corpus? Or programmatically compose two entries together to form a new record? Reader implementations are useful for complex file types or distributed storage mechanisms.
Readers return Writable
classes that describe each column in a Record
. These classes are used to convert each record to a tensor/ND-Array format.
Usage
Each reader implementation extends BaseRecordReader
and provides a simple API for selecting the next record in a dataset, acting similarly to iterators.
Useful methods include:
next
: Return a batch ofWritable
.nextRecord
: Return a singleRecord
, optionally withRecordMetaData
.reset
: Reset the underlying iterator.hasNext
: Iterator method to determine if another record is available.
Listeners
You can hook a custom RecordListener
to a record reader for debugging or visualization purposes. Pass your custom listener to the addListener
base method immediately after initializing your class.
Types of readers
initialize
public void initialize(InputSplit split) throws IOException, InterruptedException
RecordReader for each pipeline. Individual record is a concatenation of the two collections.Create a recordreader that takes recordreaders and iterates over them and concatenates themhasNext would be the & of all the recordreadersconcatenation would be next & addAll on the collectionreturn one record
ConcatenatingRecordReader
Combine multiple readers into a single reader. Records are read sequentially - thus if the first reader has100 records, and the second reader has 200 records, ConcatenatingRecordReader will have 300 records.
FileRecordReader
File reader/writer
getCurrentLabel
public int getCurrentLabel()
Return the current label.The index of the current file’s parent directoryin the label list
- return The index of the current file’s parent directory
LineRecordReader
Reads files line by line
CollectionRecordReader
Collection record reader.Mainly used for testing.
CollectionSequenceRecordReader
Collection record reader for sequences.Mainly used for testing.
initialize
public void initialize(InputSplit split) throws IOException, InterruptedException
- param records Collection of sequences. For example, List
- > where the inner two listsare a sequence, and the outer list/collection is a list of sequences
ListStringRecordReader
Iterates through a list of strings return a record.
initialize
public void initialize(InputSplit split) throws IOException, InterruptedException
Called once at initialization.
- param split the split that defines the range of records to read
- throws IOException
- throws InterruptedException
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
Called once at initialization.
- param conf a configuration for initialization
- param split the split that defines the range of records to read
- throws IOException
- throws InterruptedException
hasNext
public boolean hasNext()
Get the next record
- return The list of next record
reset
public void reset()
List of label strings
- return
nextRecord
public Record nextRecord()
Load the record from the given DataInputStreamUnlike {- link #next()} the internal state of the RecordReader is not modifiedImplementations of this method should not close the DataInputStream
- param uri
- param dataInputStream
- throws IOException if error occurs during reading from the input stream
close
public void close() throws IOException
Closes this stream and releases any system resources associatedwith it. If the stream is already closed then invoking thismethod has no effect.
As noted in {- link AutoCloseable#close()}, cases where theclose may fail require careful attention. It is strongly advisedto relinquish the underlying resources and to internallymark the {- code Closeable} as closed, prior to throwingthe {- code IOException}.
- throws IOException if an I/O error occurs
setConf
public void setConf(Configuration conf)
Set the configuration to be used by this object.
- param conf
getConf
public Configuration getConf()
Return the configuration used by this object.
CSVRecordReader
Simple csv record reader.
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
Skip first n lines
- param skipNumLines the number of lines to skip
CSVRegexRecordReader
A CSVRecordReader that can spliteach column into additional columns using regexs.
CSVSequenceRecordReader
CSV Sequence Record ReaderThis reader is intended to read sequences of data in CSV format, whereeach sequence is defined in its own file (and there are multiple files)Each line in the file represents one time step
CSVVariableSlidingWindowRecordReader
A sliding window of variable size across an entire CSV.
In practice the sliding window size starts at 1, then linearly increase to maxLinesPer sequence, thenlinearly decrease back to 1.
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
No-arg constructor with the default number of lines per sequence (10)
LibSvmRecordReader
Record reader for libsvm format, which is closelyrelated to SVMLight format. Similar to scikit-learnwe use a single reader for both formats, so this classis a subclass of SVMLightRecordReader.
Further details on the format can be found at
- http://svmlight.joachims.org/
- http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
- http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
MatlabRecordReader
Matlab record reader
SVMLightRecordReader
Record reader for SVMLight format, which can generallybe described as
LABEL INDEX:VALUE INDEX:VALUE …
SVMLight format is well-suited to sparse data (e.g.,bag-of-words) because it omits all features with valuezero.
We support an “extended” version that allows for multipletargets (or labels) separated by a comma, as follows:
LABEL1,LABEL2,… INDEX:VALUE INDEX:VALUE …
This can be used to represent either multitask problems ormultilabel problems with sparse binary labels (controlledvia the “MULTILABEL” configuration option).
Like scikit-learn, we support both zero-based and one-based indexing.
Further details on the format can be found at
- http://svmlight.joachims.org/
- http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multilabel.html
- http://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_svmlight_file.html
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
Must be called before attempting to read records.
- param conf DataVec configuration
- param split FileSplit
- throws IOException
- throws InterruptedException
setConf
public void setConf(Configuration conf)
Set configuration.
- param conf DataVec configuration
- throws IOException
- throws InterruptedException
hasNext
public boolean hasNext()
Helper function to help detect lines that arecommented out. May read ahead and cache a line.
- return
nextRecord
public Record nextRecord()
Return next record as list of Writables.
- return
RegexLineRecordReader
RegexLineRecordReader: Read a file, one line at a time, and split it into fields using a regex.To load an entire file using a
Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!”using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)”would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]
RegexSequenceRecordReader
RegexSequenceRecordReader: Read an entire file (as a sequence), one line at a time andsplit each line into fields using a regex.
Example: Data in format “2016-01-01 23:59:59.001 1 DEBUG First entry message!”using regex String “(\d{4}-\d{2}-\d{2} \d{2}:\d{2}:\d{2}.\d{3}) (\d+) ([A-Z]+) (.)”would be split into 4 Text writables: [“2016-01-01 23:59:59.001”, “1”, “DEBUG”, “First entry message!”]
lines that don’t match the provided regex can result in an exception (FailOnInvalid), can be skipped silently (SkipInvalid),or skip invalid but log a warning (SkipInvalidWithWarning)
TransformProcessRecordReader
to have a transform process applied before being returned.
initialize
public void initialize(InputSplit split) throws IOException, InterruptedException
Called once at initialization.
- param split the split that defines the range of records to read
- throws IOException
- throws InterruptedException
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
Called once at initialization.
- param conf a configuration for initialization
- param split the split that defines the range of records to read
- throws IOException
- throws InterruptedException
hasNext
public boolean hasNext()
Get the next record
- return
reset
public void reset()
List of label strings
- return
nextRecord
public Record nextRecord()
Load the record from the given DataInputStreamUnlike {- link #next()} the internal state of the RecordReader is not modifiedImplementations of this method should not close the DataInputStream
- param uri
- param dataInputStream
- throws IOException if error occurs during reading from the input stream
loadFromMetaData
public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOException
Load a single record from the given {- link RecordMetaData} instanceNote: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient toload multiple records at once using {- link #loadFromMetaData(List)}
- param recordMetaData Metadata for the record that we want to load from
- return Single record for the given RecordMetaData instance
- throws IOException If I/O error occurs during loading
setListeners
public void setListeners(RecordListener... listeners)
Load multiple records from the given a list of {- link RecordMetaData} instances
- param recordMetaDatas Metadata for the records that we want to load from
- return Multiple records for the given RecordMetaData instances
- throws IOException If I/O error occurs during loading
setListeners
public void setListeners(Collection<RecordListener> listeners)
Set the record listeners for this record reader.
- param listeners
close
public void close() throws IOException
Closes this stream and releases any system resources associatedwith it. If the stream is already closed then invoking thismethod has no effect.
As noted in {- link AutoCloseable#close()}, cases where theclose may fail require careful attention. It is strongly advisedto relinquish the underlying resources and to internallymark the {- code Closeable} as closed, prior to throwingthe {- code IOException}.
- throws IOException if an I/O error occurs
setConf
public void setConf(Configuration conf)
Set the configuration to be used by this object.
- param conf
getConf
public Configuration getConf()
Return the configuration used by this object.
TransformProcessSequenceRecordReader
to be transformed before being returned.
setConf
public void setConf(Configuration conf)
Set the configuration to be used by this object.
- param conf
getConf
public Configuration getConf()
Return the configuration used by this object.
batchesSupported
public boolean batchesSupported()
Returns a sequence record.
- return a sequence of records
nextSequence
public SequenceRecord nextSequence()
Load a sequence record from the given DataInputStreamUnlike {- link #next()} the internal state of the RecordReader is not modifiedImplementations of this method should not close the DataInputStream
- param uri
- param dataInputStream
- throws IOException if error occurs during reading from the input stream
loadSequenceFromMetaData
public SequenceRecord loadSequenceFromMetaData(RecordMetaData recordMetaData) throws IOException
Load a single sequence record from the given {- link RecordMetaData} instanceNote: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient toload multiple records at once using {- link #loadSequenceFromMetaData(List)}
- param recordMetaData Metadata for the sequence record that we want to load from
- return Single sequence record for the given RecordMetaData instance
- throws IOException If I/O error occurs during loading
initialize
public void initialize(InputSplit split) throws IOException, InterruptedException
Load multiple sequence records from the given a list of {- link RecordMetaData} instances
- param recordMetaDatas Metadata for the records that we want to load from
- return Multiple sequence record for the given RecordMetaData instances
- throws IOException If I/O error occurs during loading
initialize
public void initialize(Configuration conf, InputSplit split) throws IOException, InterruptedException
Called once at initialization.
- param conf a configuration for initialization
- param split the split that defines the range of records to read
- throws IOException
- throws InterruptedException
hasNext
public boolean hasNext()
Get the next record
- return
reset
public void reset()
List of label strings
- return
nextRecord
public Record nextRecord()
Load the record from the given DataInputStreamUnlike {- link #next()} the internal state of the RecordReader is not modifiedImplementations of this method should not close the DataInputStream
- param uri
- param dataInputStream
- throws IOException if error occurs during reading from the input stream
loadFromMetaData
public Record loadFromMetaData(RecordMetaData recordMetaData) throws IOException
Load a single record from the given {- link RecordMetaData} instanceNote: that for data that isn’t splittable (i.e., text data that needs to be scanned/split), it is more efficient toload multiple records at once using {- link #loadFromMetaData(List)}
- param recordMetaData Metadata for the record that we want to load from
- return Single record for the given RecordMetaData instance
- throws IOException If I/O error occurs during loading
setListeners
public void setListeners(RecordListener... listeners)
Load multiple records from the given a list of {- link RecordMetaData} instances
- param recordMetaDatas Metadata for the records that we want to load from
- return Multiple records for the given RecordMetaData instances
- throws IOException If I/O error occurs during loading
setListeners
public void setListeners(Collection<RecordListener> listeners)
Set the record listeners for this record reader.
- param listeners
close
public void close() throws IOException
Closes this stream and releases any system resources associatedwith it. If the stream is already closed then invoking thismethod has no effect.
As noted in {- link AutoCloseable#close()}, cases where theclose may fail require careful attention. It is strongly advisedto relinquish the underlying resources and to internallymark the {- code Closeable} as closed, prior to throwingthe {- code IOException}.
- throws IOException if an I/O error occurs
NativeAudioRecordReader
Native audio file loader using FFmpeg.
WavFileRecordReader
Wav file loader
ImageRecordReader
Image record reader.Reads a local file system and parses images of a givenheight and width.All images are rescaled and converted to the given height, width, and number of channels.
Also appends the label if specified(one of k encoding based on the directory structure where each subdir of the root is an indexed label)
TfidfRecordReader
TFIDF record reader (wraps a tfidf vectorizerfor delivering labels and conforming to the record reader interface)