- Examples
- Text Classification
- Named Entity Recognition Tagging
- Natural Language Understanding
- Machine Translation
- Chit-Chat Dialogue Modeling through Sequence2Sequence
- Sentiment Analysis
- Image Classification
- Image Classification (MNIST)
- Image Captioning
- One-shot Learning with Siamese Networks
- Visual Question Answering
- Speaker Verification
- Kaggle's Titanic: Predicting survivors
- Time series forecasting
- Time series forecasting (weather data example)
- Movie rating prediction
- Multi-label classification
- Multi-Task Learning
- Simple Regression: Fuel Efficiency Prediction
- Binary Classification: Fraud Transactions Identification
Examples
This section contains several examples of how to build models with Ludwig for a variety of tasks.For each task we show an example dataset and a sample model definition that can be used to train a model from that data.
Text Classification
This example shows how to build a text classifier with Ludwig.It can be performed using the Reuters-21578 dataset, in particular the version available on CMU's Text Analitycs course website.Other datasets available on the same webpage, like OHSUMED, is a well-known medical abstracts dataset, and Epinions.com, a dataset of product reviews, can be used too as the name of the columns is the same.
text | class |
---|---|
Toronto Feb 26 - Standard Trustco said it expects earnings in 1987 to increase at least 15… | earnings |
New York Feb 26 - American Express Co remained silent on market rumors… | acquisition |
BANGKOK March 25 - Vietnam will resettle 300000 people on state farms known as new economic… | coffee |
- ludwig experiment \
- --data_csv text_classification.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: text
- type: text
- level: word
- encoder: parallel_cnn
- output_features:
- -
- name: class
- type: category
Named Entity Recognition Tagging
utterance | tag |
---|---|
Blade Runner is a 1982 neo-noir science fiction film directed by Ridley Scott | Movie Movie O O Date O O O O O O Person Person |
Harrison Ford and Rutger Hauer starred in it | Person Person O Person person O O O |
Philip Dick 's novel Do Androids Dream of Electric Sheep ? was published in 1968 | Person Person O O Book Book Book Book Book Book Book O O O Date |
- ludwig experiment \
- --data_csv sequence_tags.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: utterance
- type: text
- level: word
- encoder: rnn
- cell_type: lstm
- reduce_output: null
- preprocessing:
- word_format: space
- output_features:
- -
- name: tag
- type: sequence
- decoder: tagger
Natural Language Understanding
utterance | intent | slots |
---|---|---|
I want a pizza | order_food | O O O B-Food_type |
Book a flight to Boston | book_flight | O O O O B-City |
Book a flight at 7pm to London | book_flight | O O O O B-Departure_time O B-City |
- ludwig experiment \
- --data_csv nlu.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: utterance
- type: text
- level: word
- encoder: rnn
- cell_type: lstm
- bidirectional: true
- num_layers: 2
- reduce_output: null
- preprocessing:
- word_format: space
- output_features:
- -
- name: intent
- type: category
- reduce_input: sum
- num_fc_layers: 1
- fc_size: 64
- -
- name: slots
- type: sequence
- decoder: tagger
Machine Translation
english | italian |
---|---|
Hello! How are you doing? | Ciao, come stai? |
I got promoted today | Oggi sono stato promosso! |
Not doing well today | Oggi non mi sento bene |
- ludwig experiment \
- --data_csv translation.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: english
- type: text
- level: word
- encoder: rnn
- cell_type: lstm
- reduce_output: null
- preprocessing:
- word_format: english_tokenize
- output_features:
- -
- name: italian
- type: text
- level: word
- decoder: generator
- cell_type: lstm
- attention: bahdanau
- loss:
- type: sampled_softmax_cross_entropy
- preprocessing:
- word_format: italian_tokenize
- training:
- batch_size: 96
Chit-Chat Dialogue Modeling through Sequence2Sequence
user1 | user2 |
---|---|
Hello! How are you doing? | Doing well, thanks! |
I got promoted today | Congratulations! |
Not doing well today | I’m sorry, can I do something to help you? |
- ludwig experiment \
- --data_csv chitchat.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: user1
- type: text
- level: word
- encoder: rnn
- cell_type: lstm
- reduce_output: null
- output_features:
- -
- name: user2
- type: text
- level: word
- decoder: generator
- cell_type: lstm
- attention: bahdanau
- loss:
- type: sampled_softmax_cross_entropy
- training:
- batch_size: 96
Sentiment Analysis
review | sentiment |
---|---|
The movie was fantastic! | positive |
Great acting and cinematography | positive |
The acting was terrible! | negative |
- ludwig experiment \
- --data_csv sentiment.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: review
- type: text
- level: word
- encoder: parallel_cnn
- output_features:
- -
- name: sentiment
- type: category
Image Classification
image_path | class |
---|---|
images/image_000001.jpg | car |
images/image_000002.jpg | dog |
images/image_000003.jpg | boat |
- ludwig experiment \
- --data_csv image_classification.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: image_path
- type: image
- encoder: stacked_cnn
- output_features:
- -
- name: class
- type: category
Image Classification (MNIST)
This is a complete example of training an image classification model on the MNISTdataset.
Download the MNIST dataset.
- git clone https://github.com/myleott/mnist_png.git
- cd mnist_png/
- tar -xf mnist_png.tar.gz
- cd mnist_png/
Create train and test CSVs.
Open python shell in the same directory and run this:
- import os
- for name in ['training', 'testing']:
- with open('mnist_dataset_{}.csv'.format(name), 'w') as output_file:
- print('=== creating {} dataset ==='.format(name))
- output_file.write('image_path,label\n')
- for i in range(10):
- path = '{}/{}'.format(name, i)
- for file in os.listdir(path):
- if file.endswith(".png"):
- output_file.write('{},{}\n'.format(os.path.join(path, file), str(i)))
Now you should have mnist_dataset_training.csv
and mnist_dataset_testing.csv
containing 60000 and 10000 examples correspondingly and having the following format
image_path | label |
---|---|
training/0/16585.png | 0 |
training/0/24537.png | 0 |
training/0/25629.png | 0 |
Train a model.
From the directory where you have virtual environment with ludwig installed:
- ludwig train \
- --data_train_csv <PATH_TO_MNIST_DATASET_TRAINING_CSV> \
- --data_test_csv <PATH_TO_MNIST_DATASET_TEST_CSV> \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: image_path
- type: image
- encoder: stacked_cnn
- conv_layers:
- -
- num_filters: 32
- filter_size: 3
- pool_size: 2
- pool_stride: 2
- -
- num_filters: 64
- filter_size: 3
- pool_size: 2
- pool_stride: 2
- dropout: true
- fc_layers:
- -
- fc_size: 128
- dropout: true
- output_features:
- -
- name: label
- type: category
- training:
- dropout_rate: 0.4
Image Captioning
image_path | caption |
---|---|
imagenet/image_000001.jpg | car driving on the street |
imagenet/image_000002.jpg | dog barking at a cat |
imagenet/image_000003.jpg | boat sailing in the ocean |
- ludwig experiment \
- --data_csv image captioning.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: image_path
- type: image
- encoder: stacked_cnn
- output_features:
- -
- name: caption
- type: text
- level: word
- decoder: generator
- cell_type: lstm
One-shot Learning with Siamese Networks
This example can be considered a simple baseline for one-shot learning on the Omniglot dataset.The task is, given two images of two handwritten characters, recognize if they are two instances of the same character or not.
image_path_1 | image_path_2 | similarity |
---|---|---|
balinese/character01/0108_13.png | balinese/character01/0108_18.png | 1 |
balinese/character01/0108_13.png | balinese/character08/0115_12.png | 0 |
balinese/character01/0108_04.png | balinese/character01/0108_08.png | 1 |
balinese/character01/0108_11.png | balinese/character05/0112_02.png | 0 |
- ludwig experiment \
- --data_csv balinese_characters.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: image_path_1
- type: image
- encoder: stacked_cnn
- preprocessing:
- width: 28
- height: 28
- resize_image: true
- -
- name: image_path_2
- type: image
- encoder: stacked_cnn
- resize_image: true
- width: 28
- height: 28
- tied_weights: image_path_1
- combiner:
- type: concat
- num_fc_layers: 2
- fc_size: 256
- output_features:
- -
- name: similarity
- type: binary
Visual Question Answering
image_path | question | answer |
---|---|---|
imdata/image_000001.jpg | Is there snow on the mountains? | yes |
imdata/image_000002.jpg | What color are the wheels | blue |
imdata/image_000003.jpg | What kind of utensil is in the glass bowl | knife |
- ludwig experiment \
- --data_csv vqa.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: image_path
- type: image
- encoder: stacked_cnn
- -
- name: question
- type: text
- level: word
- encoder: parallel_cnn
- output_features:
- -
- name: answer
- type: text
- level: word
- decoder: generator
- cell_type: lstm
- loss:
- type: sampled_softmax_cross_entropy
Speaker Verification
This example describes how to use Ludwig for a simple speaker verification task.We assume to have the following data with label 0 corresponding to an audio file of an unauthorized voice andlabel 1 corresponding to an audio file of an authorized voice.The sample data looks as follows:
audio_path | label |
---|---|
audiodata/audio_000001.wav | 0 |
audiodata/audio_000002.wav | 0 |
audiodata/audio_000003.wav | 1 |
audiodata/audio_000004.wav | 1 |
- ludwig experiment \
- --data_csv speaker_verification.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: audio_path
- type: audio
- preprocessing:
- audio_file_length_limit_in_s: 7.0
- audio_feature:
- type: stft
- window_length_in_s: 0.04
- window_shift_in_s: 0.02
- encoder: cnnrnn
- output_features:
- -
- name: label
- type: binary
Kaggle's Titanic: Predicting survivors
This example describes how to use Ludwig to train a model for thekaggle competition, on predicting a passenger's probability of surviving the Titanicdisaster. Here's a sample of the data:
Pclass | Sex | Age | SibSp | Parch | Fare | Survived | Embarked |
---|---|---|---|---|---|---|---|
3 | male | 22 | 1 | 0 | 7.2500 | 0 | S |
1 | female | 38 | 1 | 0 | 71.2833 | 1 | C |
3 | female | 26 | 0 | 0 | 7.9250 | 0 | S |
3 | male | 35 | 0 | 0 | 8.0500 | 0 | S |
The full data and the column descriptions can be found here.
After downloading the data, to train a model on this dataset using Ludwig,
- ludwig experiment \
- --data_csv <PATH_TO_TITANIC_CSV> \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: Pclass
- type: category
- -
- name: Sex
- type: category
- -
- name: Age
- type: numerical
- preprocessing:
- missing_value_strategy: fill_with_mean
- -
- name: SibSp
- type: numerical
- -
- name: Parch
- type: numerical
- -
- name: Fare
- type: numerical
- preprocessing:
- missing_value_strategy: fill_with_mean
- -
- name: Embarked
- type: category
- output_features:
- -
- name: Survived
- type: binary
Better results can be obtained with morerefined feature transformations and preprocessing, but this example has the only aim to show how this type do tasks and data can be used in Ludwig.
Time series forecasting
While direct timeseries prediction is a work in progress Ludwig can ingest timeseries input feature data and make numerical predictions. Below is an example of a model trained to forecast timeseries at five different horizons.
timeseries_data | y1 | y2 | y3 | y4 | y5 |
---|---|---|---|---|---|
15.07 14.89 14.45 … | 16.92 | 16.67 | 16.48 | 17.00 | 17.02 |
14.89 14.45 14.30 … | 16.67 | 16.48 | 17.00 | 17.02 | 16.48 |
14.45 14.3 14.94 … | 16.48 | 17.00 | 17.02 | 16.48 | 15.82 |
- ludwig experiment \
- --data_csv timeseries_data.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: timeseries_data
- type: timeseries
- output_features:
- -
- name: y1
- type: numerical
- -
- name: y2
- type: numerical
- -
- name: y3
- type: numerical
- -
- name: y4
- type: numerical
- -
- name: y5
- type: numerical
Time series forecasting (weather data example)
This example illustrates univariate timeseries forecasting using historical temperature data for Los Angeles.
Dowload and unpack historical hourly weather data available on Kagglehttps://www.kaggle.com/selfishgene/historical-hourly-weather-data
Run the following python script to prepare the training dataset:
- import pandas as pd
- from ludwig.utils.data_utils import add_sequence_feature_column
- df = pd.read_csv(
- '<PATH_TO_FILE>/temperature.csv',
- usecols=['Los Angeles']
- ).rename(
- columns={"Los Angeles": "temperature"}
- ).fillna(method='backfill').fillna(method='ffill')
- # normalize
- df.temperature = ((df.temperature-df.temperature.mean()) /
- df.temperature.std())
- train_size = int(0.6 * len(df))
- vali_size = int(0.2 * len(df))
- # train, validation, test split
- df['split'] = 0
- df.loc[
- (
- (df.index.values >= train_size) &
- (df.index.values < train_size + vali_size)
- ),
- ('split')
- ] = 1
- df.loc[
- df.index.values >= train_size + vali_size,
- ('split')
- ] = 2
- # prepare timeseries input feature colum
- # (here we are using 20 preceeding values to predict the target)
- add_sequence_feature_column(df, 'temperature', 20)
- df.to_csv('<PATH_TO_FILE>/temperature_la.csv')
- ludwig experiment \
- --data_csv <PATH_TO_FILE>/temperature_la.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: temperature_feature
- type: timeseries
- encoder: rnn
- embedding_size: 32
- state_size: 32
- output_features:
- -
- name: temperature
- type: numerical
Movie rating prediction
year | duration | nominations | categories | rating |
---|---|---|---|---|
1921 | 3240 | 0 | comedy drama | 8.4 |
1925 | 5700 | 1 | adventure comedy | 8.3 |
1927 | 9180 | 4 | drama comedy scifi | 8.4 |
- ludwig experiment \
- --data_csv movie_ratings.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: year
- type: numerical
- -
- name: duration
- type: numerical
- -
- name: nominations
- type: numerical
- -
- name: categories
- type: set
- output_features:
- -
- name: rating
- type: numerical
Multi-label classification
image_path | tags |
---|---|
images/image_000001.jpg | car man |
images/image_000002.jpg | happy dog tie |
images/image_000003.jpg | boat water |
- ludwig experiment \
- --data_csv image_data.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: image_path
- type: image
- encoder: stacked_cnn
- output_features:
- -
- name: tags
- type: set
Multi-Task Learning
This example is inspired by the classic paper Natural Language Processing (Almost) from Scratch by Collobert et al..
sentence | chunks | part_of_speech | named_entities |
---|---|---|---|
San Francisco is very foggy | B-NP I-NP B-VP B-ADJP I-ADJP | NNP NNP VBZ RB JJ | B-Loc I-Loc O O O |
My dog likes eating sausage | B-NP I-NP B-VP B-VP B-NP | PRP NN VBZ VBG NN | O O O O O |
Brutus Killed Julius Caesar | B-NP B-VP B-NP I-NP | NNP VBD NNP NNP | B-Per O B-Per I-Per |
- ludwig experiment \
- --data_csv nl_data.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: sentence
- type: sequence
- encoder: rnn
- cell: lstm
- bidirectional: true
- reduce_output: null
- output_features:
- -
- name: chunks
- type: sequence
- decoder: tagger
- -
- name: part_of_speech
- type: sequence
- decoder: tagger
- -
- name: named_entities
- type: sequence
- decoder: tagger
Simple Regression: Fuel Efficiency Prediction
This example replicates the Keras example at https://www.tensorflow.org/tutorials/keras/basic_regression to predict the miles per gallon of a car given its characteristics in the Auto MPG dataset.
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | ModelYear | Origin |
---|---|---|---|---|---|---|---|
18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 |
15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 |
18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 |
16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 |
- ludwig experiment \
- --data_csv auto_mpg.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- training:
- batch_size: 32
- epochs: 1000
- early_stop: 50
- learning_rate: 0.001
- optimizer:
- type: rmsprop
- input_features:
- -
- name: Cylinders
- type: numerical
- -
- name: Displacement
- type: numerical
- -
- name: Horsepower
- type: numerical
- -
- name: Weight
- type: numerical
- -
- name: Acceleration
- type: numerical
- -
- name: ModelYear
- type: numerical
- -
- name: Origin
- type: category
- output_features:
- -
- name: MPG
- type: numerical
- optimizer:
- type: mean_squared_error
- num_fc_layers: 2
- fc_size: 64
Binary Classification: Fraud Transactions Identification
transaction_id | card_id | customer_id | customer_zipcode | merchant_id | merchant_name | merchant_category | merchant_zipcode | merchant_country | transaction_amount | authorization_response_code | atm_network_xid | cvv_2_response_xflg | fraud_label |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
469483 | 9003 | 1085 | 23039 | 893 | Wright Group | 7917 | 91323 | GB | 1962 | C | C | N | 0 |
926515 | 9009 | 1001 | 32218 | 1011 | Mums Kitchen | 5813 | 10001 | US | 1643 | C | D | M | 1 |
730021 | 9064 | 1174 | 9165 | 916 | Keller | 7582 | 38332 | DE | 1184 | D | B | M | 0 |
- ludwig experiment \
- --data_csv transactions.csv \
- --model_definition_file model_definition.yaml
With model_definition.yaml
:
- input_features:
- -
- name: customer_id
- type: category
- -
- name: card_id
- type: category
- -
- name: merchant_id
- type: category
- -
- name: merchant_category
- type: category
- -
- name: merchant_zipcode
- type: category
- -
- name: transaction_amount
- type: numerical
- -
- name: authorization_response_code
- type: category
- -
- name: atm_network_xid
- type: category
- -
- name: cvv_2_response_xflg
- type: category
- combiner:
- type: concat
- num_fc_layers: 1
- fc_size: 48
- output_features:
- -
- name: fraud_label
- type: binary