SGD (Hogwild)
To use the original Hogwild dataset we must download it and parse it into a Faasm-friendly form.
This can be done once and uploaded to S3 (see below). From then on it can just be downloaded directlyfrom S3 on relevant machines.
From there it must be uploaded into the relevant state storage for running the algorithm.
Downloading from S3
To download the pre-processed data from S3, run the following:
inv data.reuters-download-s3
If running on a remote host you can then move the files:
scp -r ~/faasm/data <USER>@<HOST>:/home/<USER>/faasm/data
State upload
This can put into the relevant state store:
# Locally (make sure containers are running)
inv data.reuters-state-upload localhost
# K8s
inv data.reuters-state-upload <k8s_service_host>
# AWS - load into S3, then into Redis via Lambda function
inv data.reuters-state-upload-s3
inv data.reuters-prepare-aws
Invoking the function
First make sure the latest function is in place:
inv upload sgd reuters_svm --prebuilt
Clear the worker set if you've restarted the application:
inv redis.clear-queue
Invoke the process with:
# input = number of worker processes to run
inv invoke sgd reuters_svm --input=10 --poll
Native run
To run the SGD code natively, you need to download the pre-processed data, upload it to the local Redis state storage, then build and execute the main SGD function natively.
Preparing data from scratch (one off)
To actually generate the parsed data in the first place, we must use exactly the same RCV1 data as the originalHogwild experiments. The original code can be found on their website.To set up locally you can run:
# Clone fork of Hogwild
cd ansible
ansible-playbook hogwild.yml
# Put the data into a suitable format (output at ~/faasm/rcv1/test)
cd /usr/local/code/hogwild
./bin/svm_data
# Parse (run from build dir for this repo)
./bin/reuters_parse
# Upload data from ~/faasm/data/reuters
inv data.reuters-upload-s3