- API Reference
- SharedTrainingMaster
- fromJson
- fromYaml
- collectTrainingStats
- repartitionData
- repartitionStrategy
- storageLevel
- rddTrainingApproach
- exportDirectory
- rngSeed
- updatesThreshold
- thresholdAlgorithm
- residualPostProcessor
- batchSizePerWorker
- workersPerNode
- debugLongerIterations
- transport
- workerPrefetchNumBatches
- repartitioner
- workerTogglePeriodicGC
- workerPeriodicGCFrequency
- encodingDebugMode
- SharedTrainingMaster
- SparkComputationGraph
- getSparkContext
- getNetwork
- getTrainingMaster
- setNetwork
- getDefaultEvaluationWorkers
- setDefaultEvaluationWorkers
- fit
- fit
- fit
- fit
- fitPaths
- fitPathsMultiDataSet
- fitMultiDataSet
- getScore
- calculateScore
- calculateScore
- calculateScoreMultiDataSet
- calculateScoreMultiDataSet
- scoreExamples
- scoreExamples
- scoreExamplesMultiDataSet
- scoreExamplesMultiDataSet
- evaluate
- evaluate
- evaluateROCMDS
- SparkDl4jMultiLayer
- getSparkContext
- getNetwork
- getTrainingMaster
- setNetwork
- getDefaultEvaluationWorkers
- setDefaultEvaluationWorkers
- setCollectTrainingStats
- getSparkTrainingStats
- predict
- predict
- fit
- fit
- fit
- fit
- fitPaths
- fitLabeledPoint
- fitContinuousLabeledPoint
- getScore
- calculateScore
- calculateScore
- calculateScore
- scoreExamples
- scoreExamples
- scoreExamples
- scoreExamples
- ParameterAveragingTrainingMaster
API Reference
This page provides the API reference for key classes required to do distributed training with DL4J on Spark. Before going through these, make sure you have read the introduction guide for deeplearning4j Spark training here.
SharedTrainingMaster
SharedTrainingMaster implements distributed training of neural networks using a compressed quantized gradient (update)sharing implementation based on the Strom 2015 paper “Scalable Distributed DNN Training Using Commodity GPU Cloud Computing”:https://s3-us-west-2.amazonaws.com/amazon.jobs-public-documents/strom_interspeech2015.pdf.The Deeplearning4j implementation makes a number of modifications, such as having the option to use a parameter-serverbased implementation for fault tolerance and execution where multicast networking support is not available.
fromJson
public static SharedTrainingMaster fromJson(String jsonStr)
Create a SharedTrainingMaster instance by deserializing a JSON string that has been serialized with{- link #toJson()}
- param jsonStr SharedTrainingMaster configuration serialized as JSON
fromYaml
public static SharedTrainingMaster fromYaml(String yamlStr)
Create a SharedTrainingMaster instance by deserializing a YAML string that has been serialized with{- link #toYaml()}
- param yamlStr SharedTrainingMaster configuration serialized as YAML
collectTrainingStats
public Builder collectTrainingStats(boolean enable)
Create a SharedTrainingMaster with defaults other than the RDD number of examples
- param rddDataSetNumExamples When fitting from an {- code RDD} how many examples are in each dataset?
repartitionData
public Builder repartitionData(Repartition repartition)
This parameter defines when repartition is applied (if applied).
- param repartition Repartition setting
- deprecated Use {- link #repartitioner(Repartitioner)}
repartitionStrategy
public Builder repartitionStrategy(RepartitionStrategy repartitionStrategy)
Used in conjunction with {- link #repartitionData(Repartition)} (which defines when repartitioning should beconducted), repartitionStrategy defines how the repartitioning should be done. See {- link RepartitionStrategy}for details
- param repartitionStrategy Repartitioning strategy to use
- deprecated Use {- link #repartitioner(Repartitioner)}
storageLevel
public Builder storageLevel(StorageLevel storageLevel)
Set the storage level for {- code RDD
Note: Spark’s StorageLevel.MEMORY_ONLY() and StorageLevel.MEMORY_AND_DISK() can be problematic whenit comes to off-heap data (which DL4J/ND4J uses extensively). Spark does not account for off-heap memorywhen deciding if/when to drop blocks to ensure enough free memory; consequently, for DataSet RDDs that arelarger than the total amount of (off-heap) memory, this can lead to OOM issues. Put another way: Spark countsthe on-heap size of DataSet and INDArray objects only (which is negligible) resulting in a significantunderestimate of the true DataSet object sizes. More DataSets are thus kept in memory than we can really afford.Note also that fitting directly from an {- code RDD
- param storageLevel Storage level to use for DataSet RDDs
rddTrainingApproach
public Builder rddTrainingApproach(RDDTrainingApproach rddTrainingApproach)
The approach to use when training on a {- code RDD
- param rddTrainingApproach Training approach to use when training from a {- code RDD} or {- code RDD}
exportDirectory
public Builder exportDirectory(String exportDirectory)
When {- link #rddTrainingApproach(RDDTrainingApproach)} is set to {- link RDDTrainingApproach#Export} (as it is by default)the data is exported to a temporary directory first.
Default: null. -> use {hadoop.tmp.dir}/dl4j/. In this case, data is exported to {hadoop.tmp.dir}/dl4j/SOME_UNIQUE_ID/If you specify a directory, the directory {exportDirectory}/SOME_UNIQUE_ID/ will be used instead.
- param exportDirectory Base directory to export data
rngSeed
public Builder rngSeed(long rngSeed)
Random number generator seed, used mainly for enforcing repeatable splitting/repartitioning on RDDsDefault: no seed set (i.e., random seed)
- param rngSeed RNG seed
updatesThreshold
public Builder updatesThreshold(double updatesThreshold)
- deprecated Use {- link #thresholdAlgorithm(ThresholdAlgorithm)} with (for example) {- link AdaptiveThresholdAlgorithm}
thresholdAlgorithm
public Builder thresholdAlgorithm(ThresholdAlgorithm thresholdAlgorithm)
Algorithm to use to determine the threshold for updates encoding. Lower values might improve convergence, butincrease amount of network communicationValues that are too low may also impact network convergence. If convergence problems are observed, try increasingor decreasing this by a factor of 10 - say 1e-4 and 1e-2.For technical details, see the paper Scalable Distributed DNN Training Using Commodity GPU Cloud ComputingSee also {- link ThresholdAlgorithm}Default: {- link AdaptiveThresholdAlgorithm} with default parameters
- param thresholdAlgorithm Threshold algorithm to use to determine encoding threshold
residualPostProcessor
public Builder residualPostProcessor(ResidualPostProcessor residualPostProcessor)
Residual post processor. See {- link ResidualPostProcessor} for details.
Default: {- code new ResidualClippingPostProcessor(5.0, 5)} - i.e., a {- link ResidualClippingPostProcessor}that clips the residual to +/- 5x current threshold, every 5 iterations.
- param residualPostProcessor Residual post processor to use
batchSizePerWorker
public Builder batchSizePerWorker(int batchSize)
Minibatch size to use when training workers. In principle, the source data (i.e., {- code RDD
- param batchSize Minibatch size to use when fitting each worker
workersPerNode
public Builder workersPerNode(int numWorkers)
This method allows to configure number of network training threads per cluster node.Default value: -1, which defines automated number of workers selection, based on hardware present in system(i.e., number of GPUs, if training on a GPU enabled system).When training on GPUs, you should use 1 worker per GPU (which is the default). For CPUs, 1 worker pernode is usually preferred, though multi-CPU (i.e., multiple physical CPUs) or CPUs with large core countsmay have better throughput (i.e., more examples per second) when increasing the number of workers,at the expense of more memory consumed. Note that if you increase the number of workers on a CPU system,you should set the number of OpenMP threads using the {- code OMP_NUM_THREADS} property - see{- link org.nd4j.config.ND4JEnvironmentVars#OMP_NUM_THREADS} for more details.For example, a machine with 32 physical cores could use 4 workers with {- code OMP_NUM_THREADS=8}
- param numWorkers Number of workers on each node.
debugLongerIterations
public Builder debugLongerIterations(long timeMs)
This method allows you to artificially extend iteration time using Thread.sleep() for a given time.
PLEASE NOTE: Never use that option in production environment. It’s suited for debugging purposes only.
- param timeMs
- return
transport
public Builder transport(Transport transport)
Optional method: Transport implementation to be used as TransportType.CUSTOM for VoidParameterAveraging methodGenerally not used by users
- param transport Transport to use
- return
workerPrefetchNumBatches
public Builder workerPrefetchNumBatches(int prefetchNumBatches)
Number of minibatches to asynchronously prefetch on each worker when training. Default: 2, which is usually suitablein most cases. Increasing this might help in some cases of ETL (data loading) bottlenecks, at the expenseof greater memory consumption
- param prefetchNumBatches Number of batches to prefetch
repartitioner
public Builder repartitioner(Repartitioner repartitioner)
Repartitioner to use to repartition data before fitting.DL4J performs a MapPartitions operation for training, hence how the data is partitioned can matter a lot forperformance - too few partitions (or very imbalanced partitions can result in poor cluster utilization, due tosome workers being idle. A larger number of smaller partitions can help to avoid so-called “end-of-epoch”effects where training can only complete once the last/slowest worker finishes it’s partition.Default repartitioner is {- link DefaultRepartitioner}, which repartitions equally up to a maximum of 5000partitions, and is usually suitable for most purposes. In the worst case, the “end of epoch” effectwhen using the partitioner should be limited to a maximum of the amount of time required to process a single partition.
- param repartitioner Repartitioner to use
workerTogglePeriodicGC
public Builder workerTogglePeriodicGC(boolean workerTogglePeriodicGC)
Used to disable the periodic garbage collection calls on the workers.Equivalent to {- code Nd4j.getMemoryManager().togglePeriodicGc(workerTogglePeriodicGC);}Pass false to disable periodic GC on the workers or true (equivalent to the default, or not setting it) to keep it enabled.
- param workerTogglePeriodicGC Worker periodic garbage collection setting
workerPeriodicGCFrequency
public Builder workerPeriodicGCFrequency(int workerPeriodicGCFrequency)
Used to set the periodic garbage collection frequency on the workers.Equivalent to calling {- code Nd4j.getMemoryManager().setAutoGcWindow(workerPeriodicGCFrequency);} on each workerDoes not have any effect if {- link #workerTogglePeriodicGC(boolean)} is set to false
- param workerPeriodicGCFrequency The periodic GC frequency to use on the workers
encodingDebugMode
public Builder encodingDebugMode(boolean enabled)
Enable debug mode for threshold encoding. When enabled, various statistics for the threshold and the residualwill be calculated and logged on each worker (at info log level).This information can be used to check if the encoding threshold is too big (for example, virtually all updatesare much smaller than the threshold) or too big (majority of updates are much larger than the threshold).encodingDebugMode is disabled by default.IMPORTANT: enabling this has a performance overhead, and should not be enabled unless the debug information is actually required.
- param enabled True to enable
SparkComputationGraph
Main class for training ComputationGraph networks using Spark.Also used for performing distributed evaluation and inference on these networks
getSparkContext
public JavaSparkContext getSparkContext()
Instantiate a ComputationGraph instance with the given context, network and training master.
- param sparkContext the spark context to use
- param network the network to use
- param trainingMaster Required for training. May be null if the SparkComputationGraph is only to be usedfor evaluation or inference
getNetwork
public ComputationGraph getNetwork()
- return The trained ComputationGraph
getTrainingMaster
public TrainingMaster getTrainingMaster()
- return The TrainingMaster for this network
setNetwork
public void setNetwork(ComputationGraph network)
- param network The network to be used for any subsequent training, inference and evaluation steps
getDefaultEvaluationWorkers
public int getDefaultEvaluationWorkers()
Returns the currently set default number of evaluation workers/threads.Note that when the number of workers is provided explicitly in an evaluation method, the default valueis not used.In many cases, we may want this to be smaller than the number of Spark threads, to reduce memory requirements.For example, with 32 Spark threads and a large network, we don’t want to spin up 32 instances of the networkto perform evaluation. Better (for memory requirements, and reduced cache thrashing) to use say 4 workers.If it is not set explicitly, {- link #DEFAULT_EVAL_WORKERS} will be used
- return Default number of evaluation workers (threads).
setDefaultEvaluationWorkers
public void setDefaultEvaluationWorkers(int workers)
Set the default number of evaluation workers/threads.Note that when the number of workers is provided explicitly in an evaluation method, the default valueis not used.In many cases, we may want this to be smaller than the number of Spark threads, to reduce memory requirements.For example, with 32 Spark threads and a large network, we don’t want to spin up 32 instances of the networkto perform evaluation. Better (for memory requirements, and reduced cache thrashing) to use say 4 workers.If it is not set explicitly, {- link #DEFAULT_EVAL_WORKERS} will be used
- return Default number of evaluation workers (threads).
fit
public ComputationGraph fit(RDD<DataSet> rdd)
Fit the ComputationGraph with the given data set
- param rdd Data to train on
- return Trained network
fit
public ComputationGraph fit(JavaRDD<DataSet> rdd)
Fit the ComputationGraph with the given data set
- param rdd Data to train on
- return Trained network
fit
public ComputationGraph fit(String path)
Fit the SparkComputationGraph network using a directory of serialized DataSet objectsThe assumption here is that the directory contains a number of {- link DataSet} objects, each serialized using{- link DataSet#save(OutputStream)}
- param path Path to the directory containing the serialized DataSet objcets
- return The MultiLayerNetwork after training
fit
public ComputationGraph fit(String path, int minPartitions)
- deprecated Use {- link #fit(String)}
fitPaths
public ComputationGraph fitPaths(JavaRDD<String> paths)
Fit the network using a list of paths for serialized DataSet objects.
- param paths List of paths
- return trained network
fitPathsMultiDataSet
public ComputationGraph fitPathsMultiDataSet(JavaRDD<String> paths)
Fit the ComputationGraph with the given data set
- param rdd Data to train on
- return Trained network
fitMultiDataSet
public ComputationGraph fitMultiDataSet(String path, int minPartitions)
- deprecated use {- link #fitMultiDataSet(String)}
getScore
public double getScore()
Gets the last (average) minibatch score from calling fit. This is the average score across all executors for thelast minibatch executed in each worker
calculateScore
public double calculateScore(JavaRDD<DataSet> data, boolean average)
Calculate the score for all examples in the provided {- code JavaRDD
- param data Data to score
- param average Whether to sum the scores, or average them
calculateScore
public double calculateScore(JavaRDD<DataSet> data, boolean average, int minibatchSize)
Calculate the score for all examples in the provided {- code JavaRDD
- param data Data to score
- param average Whether to sum the scores, or average them
- param minibatchSize The number of examples to use in each minibatch when scoring. If more examples are in a partition thanthis, multiple scoring operations will be done (to avoid using too much memory by doing the whole partitionin one go)
calculateScoreMultiDataSet
public double calculateScoreMultiDataSet(JavaRDD<MultiDataSet> data, boolean average)
Calculate the score for all examples in the provided {- code JavaRDD
- param data Data to score
- param average Whether to sum the scores, or average them
calculateScoreMultiDataSet
public double calculateScoreMultiDataSet(JavaRDD<MultiDataSet> data, boolean average, int minibatchSize)
Calculate the score for all examples in the provided {- code JavaRDD
- param data Data to score
- param average Whether to sum the scores, or average them
- param minibatchSize The number of examples to use in each minibatch when scoring. If more examples are in a partition thanthis, multiple scoring operations will be done (to avoid using too much memory by doing the whole partitionin one go)
scoreExamples
public JavaDoubleRDD scoreExamples(JavaRDD<DataSet> data, boolean includeRegularizationTerms)
DataSet version of {- link #scoreExamples(JavaRDD, boolean)}
scoreExamples
public JavaDoubleRDD scoreExamples(JavaRDD<DataSet> data, boolean includeRegularizationTerms, int batchSize)
DataSet version of {- link #scoreExamples(JavaPairRDD, boolean, int)}
scoreExamplesMultiDataSet
public JavaDoubleRDD scoreExamplesMultiDataSet(JavaRDD<MultiDataSet> data, boolean includeRegularizationTerms)
DataSet version of {- link #scoreExamples(JavaPairRDD, boolean)}
scoreExamplesMultiDataSet
public JavaDoubleRDD scoreExamplesMultiDataSet(JavaRDD<MultiDataSet> data, boolean includeRegularizationTerms,
int batchSize)
Score the examples individually, using a specified batch size. Unlike {- link #calculateScore(JavaRDD, boolean)},this method returns a score for each example separately. If scoring is needed for specific examples use either{- link #scoreExamples(JavaPairRDD, boolean)} or {- link #scoreExamples(JavaPairRDD, boolean, int)} which can havea key for each example.
- param data Data to score
- param includeRegularizationTerms If true: include the l1/l2 regularization terms with the score (if any)
- param batchSize Batch size to use when doing scoring
- return A JavaDoubleRDD containing the scores of each example
- see ComputationGraph#scoreExamples(MultiDataSet, boolean)
evaluate
public Evaluation evaluate(String path, DataSetLoader loader)
Score the examples individually, using the default batch size {- link #DEFAULT_EVAL_SCORE_BATCH_SIZE}. Unlike {- link #calculateScore(JavaRDD, boolean)},this method returns a score for each example separatelyNote: The provided JavaPairRDD has a key that is associated with each example and returned score.Note: The DataSet objects passed in must have exactly one example in them (otherwise: can’t have a 1:1 associationbetween keys and data sets to score)
- param data Data to score
- param includeRegularizationTerms If true: include the l1/l2 regularization terms with the score (if any)
- param Key type
- return A {- code JavaPairRDD
} containing the scores of each example - see MultiLayerNetwork#scoreExamples(DataSet, boolean)
evaluate
public Evaluation evaluate(String path, MultiDataSetLoader loader)
Evaluate the single-output network on a directory containing a set of MultiDataSet objects to be loaded with a {- link MultiDataSetLoader}.Uses default batch size of {- link #DEFAULT_EVAL_SCORE_BATCH_SIZE}
- param path Path/URI to the directory containing the datasets to load
- return Evaluation
evaluateROCMDS
public ROC evaluateROCMDS(JavaRDD<MultiDataSet> data)
{- code RDD
SparkDl4jMultiLayer
Main class for training MultiLayerNetwork networks using Spark.Also used for performing distributed evaluation and inference on these networks
getSparkContext
public JavaSparkContext getSparkContext()
Instantiate a multi layer spark instancewith the given context and network.This is the prediction constructor
- param sparkContext the spark context to use
- param network the network to use
getNetwork
public MultiLayerNetwork getNetwork()
- return The MultiLayerNetwork underlying the SparkDl4jMultiLayer
getTrainingMaster
public TrainingMaster getTrainingMaster()
- return The TrainingMaster for this network
setNetwork
public void setNetwork(MultiLayerNetwork network)
Set the network that underlies this SparkDl4jMultiLayer instacne
- param network network to set
getDefaultEvaluationWorkers
public int getDefaultEvaluationWorkers()
Returns the currently set default number of evaluation workers/threads.Note that when the number of workers is provided explicitly in an evaluation method, the default valueis not used.In many cases, we may want this to be smaller than the number of Spark threads, to reduce memory requirements.For example, with 32 Spark threads and a large network, we don’t want to spin up 32 instances of the networkto perform evaluation. Better (for memory requirements, and reduced cache thrashing) to use say 4 workers.If it is not set explicitly, {- link #DEFAULT_EVAL_WORKERS} will be used
- return Default number of evaluation workers (threads).
setDefaultEvaluationWorkers
public void setDefaultEvaluationWorkers(int workers)
Set the default number of evaluation workers/threads.Note that when the number of workers is provided explicitly in an evaluation method, the default valueis not used.In many cases, we may want this to be smaller than the number of Spark threads, to reduce memory requirements.For example, with 32 Spark threads and a large network, we don’t want to spin up 32 instances of the networkto perform evaluation. Better (for memory requirements, and reduced cache thrashing) to use say 4 workers.If it is not set explicitly, {- link #DEFAULT_EVAL_WORKERS} will be used
- return Default number of evaluation workers (threads).
setCollectTrainingStats
public void setCollectTrainingStats(boolean collectTrainingStats)
Set whether training statistics should be collected for debugging purposes. Statistics collection is disabled by default
- param collectTrainingStats If true: collect training statistics. If false: don’t collect.
getSparkTrainingStats
public SparkTrainingStats getSparkTrainingStats()
Get the training statistics, after collection of stats has been enabled using {- link #setCollectTrainingStats(boolean)}
- return Training statistics
predict
public Matrix predict(Matrix features)
Predict the given feature matrix
- param features the given feature matrix
- return the predictions
predict
public Vector predict(Vector point)
Predict the given vector
- param point the vector to predict
- return the predicted vector
fit
public MultiLayerNetwork fit(RDD<DataSet> trainingData)
Fit the DataSet RDD. Equivalent to fit(trainingData.toJavaRDD())
- param trainingData the training data RDD to fitDataSet
- return the MultiLayerNetwork after training
fit
public MultiLayerNetwork fit(JavaRDD<DataSet> trainingData)
Fit the DataSet RDD
- param trainingData the training data RDD to fitDataSet
- return the MultiLayerNetwork after training
fit
public MultiLayerNetwork fit(String path)
Fit the SparkDl4jMultiLayer network using a directory of serialized DataSet objectsThe assumption here is that the directory contains a number of {- link DataSet} objects, each serialized using{- link DataSet#save(OutputStream)}
- param path Path to the directory containing the serialized DataSet objcets
- return The MultiLayerNetwork after training
fit
public MultiLayerNetwork fit(String path, int minPartitions)
- deprecated Use {- link #fit(String)}
fitPaths
public MultiLayerNetwork fitPaths(JavaRDD<String> paths)
Fit the network using a list of paths for serialized DataSet objects.
- param paths List of paths
- return trained network
fitLabeledPoint
public MultiLayerNetwork fitLabeledPoint(JavaRDD<LabeledPoint> rdd)
Fit a MultiLayerNetwork using Spark MLLib LabeledPoint instances.This will convert the labeled points to the internal DL4J data format and train the model on that
- param rdd the rdd to fitDataSet
- return the multi layer network that was fitDataSet
fitContinuousLabeledPoint
public MultiLayerNetwork fitContinuousLabeledPoint(JavaRDD<LabeledPoint> rdd)
Fits a MultiLayerNetwork using Spark MLLib LabeledPoint instancesThis will convert labeled points that have continuous labels used for regression to the internalDL4J data format and train the model on that
- param rdd the javaRDD containing the labeled points
- return a MultiLayerNetwork
getScore
public double getScore()
Gets the last (average) minibatch score from calling fit. This is the average score across all executors for thelast minibatch executed in each worker
calculateScore
public double calculateScore(RDD<DataSet> data, boolean average)
Overload of {- link #calculateScore(JavaRDD, boolean)} for {- code RDD
calculateScore
public double calculateScore(JavaRDD<DataSet> data, boolean average)
Calculate the score for all examples in the provided {- code JavaRDD
- param data Data to score
- param average Whether to sum the scores, or average them
calculateScore
public double calculateScore(JavaRDD<DataSet> data, boolean average, int minibatchSize)
Calculate the score for all examples in the provided {- code JavaRDD
- param data Data to score
- param average Whether to sum the scores, or average them
- param minibatchSize The number of examples to use in each minibatch when scoring. If more examples are in a partition thanthis, multiple scoring operations will be done (to avoid using too much memory by doing the whole partitionin one go)
scoreExamples
public JavaDoubleRDD scoreExamples(RDD<DataSet> data, boolean includeRegularizationTerms)
{- code RDD
scoreExamples
public JavaDoubleRDD scoreExamples(JavaRDD<DataSet> data, boolean includeRegularizationTerms)
Score the examples individually, using the default batch size {- link #DEFAULT_EVAL_SCORE_BATCH_SIZE}. Unlike {- link #calculateScore(JavaRDD, boolean)},this method returns a score for each example separately. If scoring is needed for specific examples use either{- link #scoreExamples(JavaPairRDD, boolean)} or {- link #scoreExamples(JavaPairRDD, boolean, int)} which can havea key for each example.
- param data Data to score
- param includeRegularizationTerms If true: include the l1/l2 regularization terms with the score (if any)
- return A JavaDoubleRDD containing the scores of each example
- see MultiLayerNetwork#scoreExamples(DataSet, boolean)
scoreExamples
public JavaDoubleRDD scoreExamples(RDD<DataSet> data, boolean includeRegularizationTerms, int batchSize)
{- code RDD
scoreExamples
public JavaDoubleRDD scoreExamples(JavaRDD<DataSet> data, boolean includeRegularizationTerms, int batchSize)
Score the examples individually, using a specified batch size. Unlike {- link #calculateScore(JavaRDD, boolean)},this method returns a score for each example separately. If scoring is needed for specific examples use either{- link #scoreExamples(JavaPairRDD, boolean)} or {- link #scoreExamples(JavaPairRDD, boolean, int)} which can havea key for each example.
- param data Data to score
- param includeRegularizationTerms If true: include the l1/l2 regularization terms with the score (if any)
- param batchSize Batch size to use when doing scoring
- return A JavaDoubleRDD containing the scores of each example
- see MultiLayerNetwork#scoreExamples(DataSet, boolean)
ParameterAveragingTrainingMaster
implementation for training networks on Spark.This is standard parameter averaging with aconfigurable averaging period.
removeHook
public void removeHook(TrainingHook trainingHook)
- param saveUpdater If true: save (and average) the updater state when doing parameter averaging
- param numWorkers Number of workers (executors threads per executor) for the cluster
- param rddDataSetNumExamples Number of examples in each DataSet object in the {- code RDD}
- param batchSizePerWorker Number of examples to use per worker per fit
- param averagingFrequency Frequency (in number of minibatches) with which to average parameters
- param aggregationDepth Number of aggregation levels used in parameter aggregation
- param prefetchNumBatches Number of batches to asynchronously prefetch (0: disable)
- param repartition Set if/when repartitioning should be conducted for the training data
- param repartitionStrategy Repartitioning strategy to use. See {- link RepartitionStrategy}
- param collectTrainingStats If true: collect training statistics for debugging/optimization purposes
addHook
public void addHook(TrainingHook trainingHook)
Add a hook for the master for pre and post training
- param trainingHook the training hook to add
fromJson
public static ParameterAveragingTrainingMaster fromJson(String jsonStr)
Create a ParameterAveragingTrainingMaster instance by deserializing a JSON string that has been serialized with{- link #toJson()}
- param jsonStr ParameterAveragingTrainingMaster configuration serialized as JSON
fromYaml
public static ParameterAveragingTrainingMaster fromYaml(String yamlStr)
Create a ParameterAveragingTrainingMaster instance by deserializing a YAML string that has been serialized with{- link #toYaml()}
- param yamlStr ParameterAveragingTrainingMaster configuration serialized as YAML
trainingHooks
public Builder trainingHooks(Collection<TrainingHook> trainingHooks)
Adds training hooks to the master.The training master will setup the workerswith the desired hooks for training.This can allow for tings like parameter serversand async updates as well as collecting statistics.
- param trainingHooks the training hooks to ad
- return
trainingHooks
public Builder trainingHooks(TrainingHook... hooks)
Adds training hooks to the master.The training master will setup the workerswith the desired hooks for training.This can allow for tings like parameter serversand async updates as well as collecting statistics.
- param hooks the training hooks to ad
- return
batchSizePerWorker
public Builder batchSizePerWorker(int batchSizePerWorker)
Same as {- link #Builder(Integer, int)} but automatically set number of workers based on JavaSparkContext.defaultParallelism()
- param rddDataSetNumExamples Number of examples in each DataSet object in the {- code RDD}
averagingFrequency
public Builder averagingFrequency(int averagingFrequency)
Frequency with which to average worker parameters.Note: Too high or too low can be bad for different reasons.
- Too low (such as 1) can result in a lot of network traffic
Too high (» 20 or so) can result in accuracy issues or problems with network convergence
param averagingFrequency Frequency (in number of minibatches of size ‘batchSizePerWorker’) to average parameters
aggregationDepth
public Builder aggregationDepth(int aggregationDepth)
The number of levels in the aggregation tree for parameter synchronization. (default: 2)Note: For large models trained with many partitions, increasing this numberwill reduce the load on the driver and help prevent it from becoming a bottleneck.
- param aggregationDepth RDD tree aggregation channels when averaging parameter updates.
workerPrefetchNumBatches
public Builder workerPrefetchNumBatches(int prefetchNumBatches)
Set the number of minibatches to asynchronously prefetch in the worker.
Default: 0 (no prefetching)
- param prefetchNumBatches Number of minibatches (DataSets of size batchSizePerWorker) to fetch
saveUpdater
public Builder saveUpdater(boolean saveUpdater)
Set whether the updater (i.e., historical state for momentum, adagrad, etc should be saved).NOTE: This can double (or more) the amount of network traffic in each direction, but mightimprove network training performance (and can be more stable for certain updaters such as adagrad).
This is enabled by default.
- param saveUpdater If true: retain the updater state (default). If false, don’t retain (updaters will bereinitalized in each worker after averaging).
repartionData
public Builder repartionData(Repartition repartition)
Set if/when repartitioning should be conducted for the training data.Default value: always repartition (if required to guarantee correct number of partitions and correct numberof examples in each partition).
- param repartition Setting for repartitioning
repartitionStrategy
public Builder repartitionStrategy(RepartitionStrategy repartitionStrategy)
Used in conjunction with {- link #repartionData(Repartition)} (which defines when repartitioning should beconducted), repartitionStrategy defines how the repartitioning should be done. See {- link RepartitionStrategy}for details
- param repartitionStrategy Repartitioning strategy to use
storageLevel
public Builder storageLevel(StorageLevel storageLevel)
Set the storage level for {- code RDD
Note: Spark’s StorageLevel.MEMORY_ONLY() and StorageLevel.MEMORY_AND_DISK() can be problematic whenit comes to off-heap data (which DL4J/ND4J uses extensively). Spark does not account for off-heap memorywhen deciding if/when to drop blocks to ensure enough free memory; consequently, for DataSet RDDs that arelarger than the total amount of (off-heap) memory, this can lead to OOM issues. Put another way: Spark countsthe on-heap size of DataSet and INDArray objects only (which is negligible) resulting in a significantunderestimate of the true DataSet object sizes. More DataSets are thus kept in memory than we can really afford.
- param storageLevel Storage level to use for DataSet RDDs
storageLevelStreams
public Builder storageLevelStreams(StorageLevel storageLevelStreams)
Set the storage level RDDs used when fitting data from Streams: either PortableDataStreams (sc.binaryFiles via{- link SparkDl4jMultiLayer#fit(String)} and {- link SparkComputationGraph#fit(String)}) or String paths(via {- link SparkDl4jMultiLayer#fitPaths(JavaRDD)}, {- link SparkComputationGraph#fitPaths(JavaRDD)} and{- link SparkComputationGraph#fitPathsMultiDataSet(JavaRDD)}).
Default storage level is StorageLevel.MEMORY_ONLY() which should be appropriate in most cases.
- param storageLevelStreams Storage level to use
rddTrainingApproach
public Builder rddTrainingApproach(RDDTrainingApproach rddTrainingApproach)
The approach to use when training on a {- code RDD
- param rddTrainingApproach Training approach to use when training from a {- code RDD} or {- code RDD}
exportDirectory
public Builder exportDirectory(String exportDirectory)
When {- link #rddTrainingApproach(RDDTrainingApproach)} is set to {- link RDDTrainingApproach#Export} (as it is by default)the data is exported to a temporary directory first.
Default: null. -> use {hadoop.tmp.dir}/dl4j/. In this case, data is exported to {hadoop.tmp.dir}/dl4j/SOME_UNIQUE_ID/If you specify a directory, the directory {exportDirectory}/SOME_UNIQUE_ID/ will be used instead.
- param exportDirectory Base directory to export data
rngSeed
public Builder rngSeed(long rngSeed)
Random number generator seed, used mainly for enforcing repeatable splitting on RDDsDefault: no seed set (i.e., random seed)
- param rngSeed RNG seed
- return
collectTrainingStats
public Builder collectTrainingStats(boolean collectTrainingStats)
Whether training stats collection should be enabled (disabled by default).
- see ParameterAveragingTrainingMaster#setCollectTrainingStats(boolean)
- see org.deeplearning4j.spark.stats.StatsUtils#exportStatsAsHTML(SparkTrainingStats, OutputStream)
- param collectTrainingStats