Developer Guide
Codebase Structure
The codebase is organized in a modular, datatype / feature centric way so that adding a feature for a new datatype is pretty straightforward and requires isolated code changes. All the datatype specific logic lives in the corresponding feature module all of which are under ludwig/features/
.
Feature classes contain raw data preprocessing logic specific to each data type. All input (output) features implement build_input
(build_output
) method which is used to build encodings (decode outputs). Output features also contain datatype-specific logic to compute output measures such as loss, accuracy, etc.
Encoders and decoders are modularized as well (they are under ludwig/models/modules
) so that they can be used by multiple features. For example sequence encoders are shared among text, sequence, and timeseries features.
Various model architecture components which can be reused are also split into dedicated modules, for example convolutional modules, fully connected modules, etc.
Bulk of the training logic resides in ludwig/models/model.py
which initializes a tensorflow session, feeds the data, and executes training.
Adding an Encoder
1. Add a new encoder class
Source code for encoders lives under ludwig/models/modules
.New encoder objects should be defined in the corresponding files, for example all new sequence encoders should be added to ludwig/models/modules/sequence_encoders.py
.
All the encoder parameters should be provided as arguments in the constructor with their default values set. For example RNN
encoder takes the following list of arguments in its constructor:
- def __init__(
- self,
- should_embed=True,
- vocab=None,
- representation='dense',
- embedding_size=256,
- embeddings_trainable=True,
- pretrained_embeddings=None,
- embeddings_on_cpu=False,
- num_layers=1,
- state_size=256,
- cell_type='rnn',
- bidirectional=False,
- dropout=False,
- initializer=None,
- regularize=True,
- reduce_output='last',
- **kwargs
- ):
Typically all the dependencies are initialized in the encoder's constructor (in the case of the RNN encoder these are EmbedSequence and RecurrentStack modules) so that at the end of the constructor call all the layers are fully described.
Actual creation of tensorflow variables takes place inside the call
method of the encoder. All encoders should have the following signature:
- __call__(
- self,
- input_placeholder,
- regularizer,
- dropout,
- is_training
- )
Inputs
- input_placeholder (tf.Tensor): input tensor.
- regularizer (A (Tensor -> Tensor or None) function): regularizer function passed to
tf.get_variable
method. - dropout (tf.Tensor(dtype: tf.float32)): dropout rate.
is_training (tf.Tensor(dtype: tf.bool), default:
True
): boolean indicating whether this is a training dataset. Returnhidden (tf.Tensor(dtype: tf.float32)): feature encodings.
- hidden_size (int): feature encodings size.
Encoders are initialized as class member variables in input feature object constructors and called inside
build_input
methods.
2. Add the new encoder class to the corresponding encoder registry
Mapping between encoder keywords in the model definition and encoder classes is done by encoder registries: for example sequence encoder registry is defined in ludwig/features/sequence_feature.py
- sequence_encoder_registry = {
- 'stacked_cnn': StackedCNN,
- 'parallel_cnn': ParallelCNN,
- 'stacked_parallel_cnn': StackedParallelCNN,
- 'rnn': RNN,
- 'cnnrnn': CNNRNN,
- 'embed': EmbedEncoder
- }
Adding a Decoder
1. Add a new decoder class
Souce code for decoders lives under ludwig/models/modules
.New decoder objects should be defined in the corresponding files, for example all new sequence decoders should be added to ludwig/models/modules/sequence_decoders.py
.
All the decoder parameters should be provided as arguments in the constructor with their default values set. For example Generator
decoder takes the following list of arguments in its constructor:
- __init__(
- self,
- cell_type='rnn',
- state_size=256,
- embedding_size=64,
- beam_width=1,
- num_layers=1,
- attention_mechanism=None,
- tied_embeddings=None,
- initializer=None,
- regularize=True,
- **kwargs
- )
Decoders are initialized as class member variables in output feature object constructors and called inside build_output
methods.
2. Add the new decoder class to the corresponding decoder registry
Mapping between decoder keywords in the model definition and decoder classes is done by decoder registries: for example sequence decoder registry is defined in ludwig/features/sequence_feature.py
- sequence_decoder_registry = {
- 'generator': Generator,
- 'tagger': Tagger
- }
Adding a new Feature Type
1. Add a new feature class
Souce code for feature classes lives under ludwig/features
.Input and output feature classes are defined in the same file, for example CategoryInputFeature
and CategoryOutputFeature
are defined in ludwig/features/category_feature.py
.
An input features inherit from the InputFeature
and corresponding base feature classes, for example CategoryInputFeature
inherits from CategoryBaseFeature
and InputFeature
.
Similarly, output features inherit from the OutputFeature
and corresponding base feature classes, for example CategoryOutputFeature
inherits from CategoryBaseFeature
and OutputFeature
.
Feature parameters are provided in a dictionary of key-value pairs as an argument to the input or output feature constructor which contains default parameter values as well.
All input and output features should implement build_input
and build_output
methods correspondingly with the following signatures:
build_input
- build_input(
- self,
- regularizer,
- dropout_rate,
- is_training=False,
- **kwargs
- )
Inputs
- regularizer (A (Tensor -> Tensor or None) function): regularizer function passed to
tf.get_variable
method. - dropout_rate (tf.Tensor(dtype: tf.float32)): dropout rate.
is_training (tf.Tensor(dtype: tf.bool), default:
True
): boolean indicating whether this is a training dataset. Returnfeature_representation (dict): the following dictionary
- {
- 'type': self.type, # str
- 'representation': feature_representation, # tf.Tensor(dtype: tf.float32)
- 'size': feature_representation_size, # int
- 'placeholder': placeholder # tf.Tensor(dtype: tf.float32)
- }
build_output
- build_output(
- self,
- hidden,
- hidden_size,
- regularizer=None,
- **kwargs
- )
Inputs
- hidden (tf.Tensor(dtype: tf.float32)): output feature representation.
- hidden_size (int): output feature representation size.
- regularizer (A (Tensor -> Tensor or None) function): regularizer function passed to
tf.get_variable
method. Return- train_mean_loss (tf.Tensor(dtype: tf.float32)): mean loss for train dataset.- eval_loss (tf.Tensor(dtype: tf.float32)): mean loss for evaluation dataset.- output_tensors (dict): dictionary containing feature specific output tensors (predictions, probabilities, losses, etc).
2. Add the new feature class to the corresponding feature registry
Input and output feature registries are defined in ludwig/features/feature_registries.py
.
Style Guidelines
We expect contributions to mimic existing patterns in the codebase and demonstrate good practices: the code should be concise, readable, PEP8-compliant, and conforming to 80 character line length limit.
Tests
We are using pytest
to run tests. Current test coverage is limited to several integration tests which ensure end-to-end functionality but we are planning to expand it.
Checklist
Before running tests, make sure 1. Your environment is properly setup.2. You have write access on the machine. Some of the tests require saving data to disk.
Running tests
To run all tests, just runpython -m pytest
from the ludwig root directory.Note that you don't need to have ludwig module installed and in this casecode change will take effect immediately.
To run a single test, run
- python -m pytest path_to_filename::test_method_name
Example
- python -m pytest tests/integration_tests/test_experiment.py::test_visual_question_answering