Examples

This section contains several examples of how to build models with Ludwig for a variety of tasks.For each task we show an example dataset and a sample model definition that can be used to train a model from that data.

Text Classification

This example shows how to build a text classifier with Ludwig.It can be performed using the Reuters-21578 dataset, in particular the version available on CMU's Text Analitycs course website.Other datasets available on the same webpage, like OHSUMED, is a well-known medical abstracts dataset, and Epinions.com, a dataset of product reviews, can be used too as the name of the columns is the same.

textclass
Toronto Feb 26 - Standard Trustco said it expects earnings in 1987 to increase at least 15…earnings
New York Feb 26 - American Express Co remained silent on market rumors…acquisition
BANGKOK March 25 - Vietnam will resettle 300000 people on state farms known as new economic…coffee
  1. ludwig experiment \
  2. --data_csv text_classification.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: text
  4. type: text
  5. level: word
  6. encoder: parallel_cnn
  7.  
  8. output_features:
  9. -
  10. name: class
  11. type: category

Named Entity Recognition Tagging

utterancetag
Blade Runner is a 1982 neo-noir science fiction film directed by Ridley ScottMovie Movie O O Date O O O O O O Person Person
Harrison Ford and Rutger Hauer starred in itPerson Person O Person person O O O
Philip Dick 's novel Do Androids Dream of Electric Sheep ? was published in 1968Person Person O O Book Book Book Book Book Book Book O O O Date
  1. ludwig experiment \
  2. --data_csv sequence_tags.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: utterance
  4. type: text
  5. level: word
  6. encoder: rnn
  7. cell_type: lstm
  8. reduce_output: null
  9. preprocessing:
  10. word_format: space
  11.  
  12. output_features:
  13. -
  14. name: tag
  15. type: sequence
  16. decoder: tagger

Natural Language Understanding

utteranceintentslots
I want a pizzaorder_foodO O O B-Food_type
Book a flight to Bostonbook_flightO O O O B-City
Book a flight at 7pm to Londonbook_flightO O O O B-Departure_time O B-City
  1. ludwig experiment \
  2. --data_csv nlu.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: utterance
  4. type: text
  5. level: word
  6. encoder: rnn
  7. cell_type: lstm
  8. bidirectional: true
  9. num_layers: 2
  10. reduce_output: null
  11. preprocessing:
  12. word_format: space
  13.  
  14. output_features:
  15. -
  16. name: intent
  17. type: category
  18. reduce_input: sum
  19. num_fc_layers: 1
  20. fc_size: 64
  21. -
  22. name: slots
  23. type: sequence
  24. decoder: tagger

Machine Translation

englishitalian
Hello! How are you doing?Ciao, come stai?
I got promoted todayOggi sono stato promosso!
Not doing well todayOggi non mi sento bene
  1. ludwig experiment \
  2. --data_csv translation.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: english
  4. type: text
  5. level: word
  6. encoder: rnn
  7. cell_type: lstm
  8. reduce_output: null
  9. preprocessing:
  10. word_format: english_tokenize
  11.  
  12. output_features:
  13. -
  14. name: italian
  15. type: text
  16. level: word
  17. decoder: generator
  18. cell_type: lstm
  19. attention: bahdanau
  20. loss:
  21. type: sampled_softmax_cross_entropy
  22. preprocessing:
  23. word_format: italian_tokenize
  24.  
  25. training:
  26. batch_size: 96

Chit-Chat Dialogue Modeling through Sequence2Sequence

user1user2
Hello! How are you doing?Doing well, thanks!
I got promoted todayCongratulations!
Not doing well todayI’m sorry, can I do something to help you?
  1. ludwig experiment \
  2. --data_csv chitchat.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: user1
  4. type: text
  5. level: word
  6. encoder: rnn
  7. cell_type: lstm
  8. reduce_output: null
  9.  
  10. output_features:
  11. -
  12. name: user2
  13. type: text
  14. level: word
  15. decoder: generator
  16. cell_type: lstm
  17. attention: bahdanau
  18. loss:
  19. type: sampled_softmax_cross_entropy
  20.  
  21. training:
  22. batch_size: 96

Sentiment Analysis

reviewsentiment
The movie was fantastic!positive
Great acting and cinematographypositive
The acting was terrible!negative
  1. ludwig experiment \
  2. --data_csv sentiment.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: review
  4. type: text
  5. level: word
  6. encoder: parallel_cnn
  7.  
  8. output_features:
  9. -
  10. name: sentiment
  11. type: category

Image Classification

image_pathclass
images/image_000001.jpgcar
images/image_000002.jpgdog
images/image_000003.jpgboat
  1. ludwig experiment \
  2. --data_csv image_classification.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: image_path
  4. type: image
  5. encoder: stacked_cnn
  6.  
  7. output_features:
  8. -
  9. name: class
  10. type: category

Image Classification (MNIST)

This is a complete example of training an image classification model on the MNISTdataset.

Download the MNIST dataset.

  1. git clone https://github.com/myleott/mnist_png.git
  2. cd mnist_png/
  3. tar -xf mnist_png.tar.gz
  4. cd mnist_png/

Create train and test CSVs.

Open python shell in the same directory and run this:

  1. import os
  2. for name in ['training', 'testing']:
  3. with open('mnist_dataset_{}.csv'.format(name), 'w') as output_file:
  4. print('=== creating {} dataset ==='.format(name))
  5. output_file.write('image_path,label\n')
  6. for i in range(10):
  7. path = '{}/{}'.format(name, i)
  8. for file in os.listdir(path):
  9. if file.endswith(".png"):
  10. output_file.write('{},{}\n'.format(os.path.join(path, file), str(i)))

Now you should have mnist_dataset_training.csv and mnist_dataset_testing.csvcontaining 60000 and 10000 examples correspondingly and having the following format

image_pathlabel
training/0/16585.png0
training/0/24537.png0
training/0/25629.png0

Train a model.

From the directory where you have virtual environment with ludwig installed:

  1. ludwig train \
  2. --data_train_csv <PATH_TO_MNIST_DATASET_TRAINING_CSV> \
  3. --data_test_csv <PATH_TO_MNIST_DATASET_TEST_CSV> \
  4. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: image_path
  4. type: image
  5. encoder: stacked_cnn
  6. conv_layers:
  7. -
  8. num_filters: 32
  9. filter_size: 3
  10. pool_size: 2
  11. pool_stride: 2
  12. -
  13. num_filters: 64
  14. filter_size: 3
  15. pool_size: 2
  16. pool_stride: 2
  17. dropout: true
  18. fc_layers:
  19. -
  20. fc_size: 128
  21. dropout: true
  22.  
  23. output_features:
  24. -
  25. name: label
  26. type: category
  27.  
  28. training:
  29. dropout_rate: 0.4

Image Captioning

image_pathcaption
imagenet/image_000001.jpgcar driving on the street
imagenet/image_000002.jpgdog barking at a cat
imagenet/image_000003.jpgboat sailing in the ocean
  1. ludwig experiment \
  2. --data_csv image captioning.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: image_path
  4. type: image
  5. encoder: stacked_cnn
  6.  
  7. output_features:
  8. -
  9. name: caption
  10. type: text
  11. level: word
  12. decoder: generator
  13. cell_type: lstm

One-shot Learning with Siamese Networks

This example can be considered a simple baseline for one-shot learning on the Omniglot dataset.The task is, given two images of two handwritten characters, recognize if they are two instances of the same character or not.

image_path_1image_path_2similarity
balinese/character01/0108_13.pngbalinese/character01/0108_18.png1
balinese/character01/0108_13.pngbalinese/character08/0115_12.png0
balinese/character01/0108_04.pngbalinese/character01/0108_08.png1
balinese/character01/0108_11.pngbalinese/character05/0112_02.png0
  1. ludwig experiment \
  2. --data_csv balinese_characters.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: image_path_1
  4. type: image
  5. encoder: stacked_cnn
  6. preprocessing:
  7. width: 28
  8. height: 28
  9. resize_image: true
  10. -
  11. name: image_path_2
  12. type: image
  13. encoder: stacked_cnn
  14. resize_image: true
  15. width: 28
  16. height: 28
  17. tied_weights: image_path_1
  18.  
  19. combiner:
  20. type: concat
  21. num_fc_layers: 2
  22. fc_size: 256
  23.  
  24. output_features:
  25. -
  26. name: similarity
  27. type: binary

Visual Question Answering

image_pathquestionanswer
imdata/image_000001.jpgIs there snow on the mountains?yes
imdata/image_000002.jpgWhat color are the wheelsblue
imdata/image_000003.jpgWhat kind of utensil is in the glass bowlknife
  1. ludwig experiment \
  2. --data_csv vqa.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: image_path
  4. type: image
  5. encoder: stacked_cnn
  6. -
  7. name: question
  8. type: text
  9. level: word
  10. encoder: parallel_cnn
  11.  
  12. output_features:
  13. -
  14. name: answer
  15. type: text
  16. level: word
  17. decoder: generator
  18. cell_type: lstm
  19. loss:
  20. type: sampled_softmax_cross_entropy

Speaker Verification

This example describes how to use Ludwig for a simple speaker verification task.We assume to have the following data with label 0 corresponding to an audio file of an unauthorized voice andlabel 1 corresponding to an audio file of an authorized voice.The sample data looks as follows:

audio_pathlabel
audiodata/audio_000001.wav0
audiodata/audio_000002.wav0
audiodata/audio_000003.wav1
audiodata/audio_000004.wav1
  1. ludwig experiment \
  2. --data_csv speaker_verification.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: audio_path
  4. type: audio
  5. preprocessing:
  6. audio_file_length_limit_in_s: 7.0
  7. audio_feature:
  8. type: stft
  9. window_length_in_s: 0.04
  10. window_shift_in_s: 0.02
  11. encoder: cnnrnn
  12.  
  13. output_features:
  14. -
  15. name: label
  16. type: binary

Kaggle's Titanic: Predicting survivors

This example describes how to use Ludwig to train a model for thekaggle competition, on predicting a passenger's probability of surviving the Titanicdisaster. Here's a sample of the data:

PclassSexAgeSibSpParchFareSurvivedEmbarked
3male22107.25000S
1female381071.28331C
3female26007.92500S
3male35008.05000S

The full data and the column descriptions can be found here.

After downloading the data, to train a model on this dataset using Ludwig,

  1. ludwig experiment \
  2. --data_csv <PATH_TO_TITANIC_CSV> \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: Pclass
  4. type: category
  5. -
  6. name: Sex
  7. type: category
  8. -
  9. name: Age
  10. type: numerical
  11. preprocessing:
  12. missing_value_strategy: fill_with_mean
  13. -
  14. name: SibSp
  15. type: numerical
  16. -
  17. name: Parch
  18. type: numerical
  19. -
  20. name: Fare
  21. type: numerical
  22. preprocessing:
  23. missing_value_strategy: fill_with_mean
  24. -
  25. name: Embarked
  26. type: category
  27.  
  28. output_features:
  29. -
  30. name: Survived
  31. type: binary

Better results can be obtained with morerefined feature transformations and preprocessing, but this example has the only aim to show how this type do tasks and data can be used in Ludwig.

Time series forecasting

While direct timeseries prediction is a work in progress Ludwig can ingest timeseries input feature data and make numerical predictions. Below is an example of a model trained to forecast timeseries at five different horizons.

timeseries_datay1y2y3y4y5
15.07 14.89 14.45 …16.9216.6716.4817.0017.02
14.89 14.45 14.30 …16.6716.4817.0017.0216.48
14.45 14.3 14.94 …16.4817.0017.0216.4815.82
  1. ludwig experiment \
  2. --data_csv timeseries_data.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: timeseries_data
  4. type: timeseries
  5.  
  6. output_features:
  7. -
  8. name: y1
  9. type: numerical
  10. -
  11. name: y2
  12. type: numerical
  13. -
  14. name: y3
  15. type: numerical
  16. -
  17. name: y4
  18. type: numerical
  19. -
  20. name: y5
  21. type: numerical

Time series forecasting (weather data example)

This example illustrates univariate timeseries forecasting using historical temperature data for Los Angeles.

Dowload and unpack historical hourly weather data available on Kagglehttps://www.kaggle.com/selfishgene/historical-hourly-weather-data

Run the following python script to prepare the training dataset:

  1. import pandas as pd
  2. from ludwig.utils.data_utils import add_sequence_feature_column
  3.  
  4. df = pd.read_csv(
  5. '<PATH_TO_FILE>/temperature.csv',
  6. usecols=['Los Angeles']
  7. ).rename(
  8. columns={"Los Angeles": "temperature"}
  9. ).fillna(method='backfill').fillna(method='ffill')
  10.  
  11. # normalize
  12. df.temperature = ((df.temperature-df.temperature.mean()) /
  13. df.temperature.std())
  14.  
  15. train_size = int(0.6 * len(df))
  16. vali_size = int(0.2 * len(df))
  17.  
  18. # train, validation, test split
  19. df['split'] = 0
  20. df.loc[
  21. (
  22. (df.index.values >= train_size) &
  23. (df.index.values < train_size + vali_size)
  24. ),
  25. ('split')
  26. ] = 1
  27. df.loc[
  28. df.index.values >= train_size + vali_size,
  29. ('split')
  30. ] = 2
  31.  
  32. # prepare timeseries input feature colum
  33. # (here we are using 20 preceeding values to predict the target)
  34. add_sequence_feature_column(df, 'temperature', 20)
  35. df.to_csv('<PATH_TO_FILE>/temperature_la.csv')
  1. ludwig experiment \
  2. --data_csv <PATH_TO_FILE>/temperature_la.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: temperature_feature
  4. type: timeseries
  5. encoder: rnn
  6. embedding_size: 32
  7. state_size: 32
  8.  
  9. output_features:
  10. -
  11. name: temperature
  12. type: numerical

Movie rating prediction

yeardurationnominationscategoriesrating
192132400comedy drama8.4
192557001adventure comedy8.3
192791804drama comedy scifi8.4
  1. ludwig experiment \
  2. --data_csv movie_ratings.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: year
  4. type: numerical
  5. -
  6. name: duration
  7. type: numerical
  8. -
  9. name: nominations
  10. type: numerical
  11. -
  12. name: categories
  13. type: set
  14.  
  15. output_features:
  16. -
  17. name: rating
  18. type: numerical

Multi-label classification

image_pathtags
images/image_000001.jpgcar man
images/image_000002.jpghappy dog tie
images/image_000003.jpgboat water
  1. ludwig experiment \
  2. --data_csv image_data.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: image_path
  4. type: image
  5. encoder: stacked_cnn
  6.  
  7. output_features:
  8. -
  9. name: tags
  10. type: set

Multi-Task Learning

This example is inspired by the classic paper Natural Language Processing (Almost) from Scratch by Collobert et al..

sentencechunkspart_of_speechnamed_entities
San Francisco is very foggyB-NP I-NP B-VP B-ADJP I-ADJPNNP NNP VBZ RB JJB-Loc I-Loc O O O
My dog likes eating sausageB-NP I-NP B-VP B-VP B-NPPRP NN VBZ VBG NNO O O O O
Brutus Killed Julius CaesarB-NP B-VP B-NP I-NPNNP VBD NNP NNPB-Per O B-Per I-Per
  1. ludwig experiment \
  2. --data_csv nl_data.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: sentence
  4. type: sequence
  5. encoder: rnn
  6. cell: lstm
  7. bidirectional: true
  8. reduce_output: null
  9.  
  10. output_features:
  11. -
  12. name: chunks
  13. type: sequence
  14. decoder: tagger
  15. -
  16. name: part_of_speech
  17. type: sequence
  18. decoder: tagger
  19. -
  20. name: named_entities
  21. type: sequence
  22. decoder: tagger

Simple Regression: Fuel Efficiency Prediction

This example replicates the Keras example at https://www.tensorflow.org/tutorials/keras/basic_regression to predict the miles per gallon of a car given its characteristics in the Auto MPG dataset.

MPGCylindersDisplacementHorsepowerWeightAccelerationModelYearOrigin
18.08307.0130.03504.012.0701
15.08350.0165.03693.011.5701
18.08318.0150.03436.011.0701
16.08304.0150.03433.012.0701
  1. ludwig experiment \
  2. --data_csv auto_mpg.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. training:
  2. batch_size: 32
  3. epochs: 1000
  4. early_stop: 50
  5. learning_rate: 0.001
  6. optimizer:
  7. type: rmsprop
  8. input_features:
  9. -
  10. name: Cylinders
  11. type: numerical
  12. -
  13. name: Displacement
  14. type: numerical
  15. -
  16. name: Horsepower
  17. type: numerical
  18. -
  19. name: Weight
  20. type: numerical
  21. -
  22. name: Acceleration
  23. type: numerical
  24. -
  25. name: ModelYear
  26. type: numerical
  27. -
  28. name: Origin
  29. type: category
  30. output_features:
  31. -
  32. name: MPG
  33. type: numerical
  34. optimizer:
  35. type: mean_squared_error
  36. num_fc_layers: 2
  37. fc_size: 64

Binary Classification: Fraud Transactions Identification

transaction_idcard_idcustomer_idcustomer_zipcodemerchant_idmerchant_namemerchant_categorymerchant_zipcodemerchant_countrytransaction_amountauthorization_response_codeatm_network_xidcvv_2_response_xflgfraud_label
4694839003108523039893Wright Group791791323GB1962CCN0
92651590091001322181011Mums Kitchen581310001US1643CDM1
730021906411749165916Keller758238332DE1184DBM0
  1. ludwig experiment \
  2. --data_csv transactions.csv \
  3. --model_definition_file model_definition.yaml

With model_definition.yaml:

  1. input_features:
  2. -
  3. name: customer_id
  4. type: category
  5. -
  6. name: card_id
  7. type: category
  8. -
  9. name: merchant_id
  10. type: category
  11. -
  12. name: merchant_category
  13. type: category
  14. -
  15. name: merchant_zipcode
  16. type: category
  17. -
  18. name: transaction_amount
  19. type: numerical
  20. -
  21. name: authorization_response_code
  22. type: category
  23. -
  24. name: atm_network_xid
  25. type: category
  26. -
  27. name: cvv_2_response_xflg
  28. type: category
  29.  
  30. combiner:
  31. type: concat
  32. num_fc_layers: 1
  33. fc_size: 48
  34.  
  35. output_features:
  36. -
  37. name: fraud_label
  38. type: binary