Regression - RandomForestRegressorTrain(batch) - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Result

Description

The random forest use the bagging to prevent the overfitting.

In the operator, we implement three type of decision tree to increase diversity of the forest.

id3 cart c4.5 and the criteria is

information gini information ratio mse

Parameters

Name	Description	Type	Required？	Default Value
featureSubsamplingRatio	Ratio of the features used in each tree, in range (0, 1].	Double		0.2
numSubsetFeatures	The number of features to consider for splits at each tree node.	Integer		2147483647
numTrees	Number of decision trees.	Integer		10
subsamplingRatio	Ratio of the training samples used for learning each decision tree.	Double		100000.0
treeType	treeType	String		“avg”
maxDepth	depth of the tree	Integer		2147483647
minSamplesPerLeaf	Minimal number of sample in one leaf.	Integer		2
createTreeMode	series or parallel	String		“series”
maxBins	MAX number of bins for continuous feature	Integer		128
maxMemoryInMB	max memory usage in tree histogram aggregate.	Integer		64
featureCols	Names of the feature columns used for training in the input table	String[]	✓
labelCol	Name of the label column in the input table	String	✓
categoricalCols	Names of the categorical columns used for training in the input table	String[]
weightCol	Name of the column indicating weight	String		null
maxLeaves	max leaves of tree	Integer		2147483647
minSampleRatioPerChild	Minimal value of: (num of samples in child)/(num of samples in its parent).	Double		0.0
minInfoGain	minimum info gain when performing split	Double		0.0

Script Example

Code

import numpy as np
import pandas as pd
from pyalink.alink import *
def exampleData():
    return np.array([
        [1.0, "A", 0, 0, 0],
        [2.0, "B", 1, 1, 0],
        [3.0, "C", 2, 2, 1],
        [4.0, "D", 3, 3, 1]
    ])
def sourceFrame():
    data = exampleData()
    return pd.DataFrame({
        "f0": data[:, 0],
        "f1": data[:, 1],
        "f2": data[:, 2],
        "f3": data[:, 3],
        "label": data[:, 4]
    })
def batchSource():
    return dataframeToOperator(
        sourceFrame(),
        schemaStr='''
    f0 double, 
    f1 string, 
    f2 int, 
    f3 int, 
    label int
    ''',
        op_type='batch'
    )
def streamSource():
    return dataframeToOperator(
        sourceFrame(),
        schemaStr='''
    f0 double, 
    f1 string, 
    f2 int, 
    f3 int, 
    label int
    ''',
        op_type='stream'
    )
trainOp = (
    RandomForestRegTrainBatchOp()
    .setLabelCol('label')
    .setFeatureCols(['f0', 'f1', 'f2', 'f3'])
)
predictBatchOp = (
    RandomForestRegPredictBatchOp()
    .setPredictionCol('pred')
)
(
    predictBatchOp
    .linkFrom(
        batchSource().link(trainOp),
        batchSource()
    )
    .print()
)
predictStreamOp = (
    RandomForestRegPredictStreamOp(
        batchSource().link(trainOp)
    )
    .setPredictionCol('pred')
)
(
    predictStreamOp
    .linkFrom(
        streamSource()
    )
    .print()
)
StreamOperator.execute()

Result

Batch prediction

    f0 f1  f2  f3  label  pred
0  1.0  A   0   0      0   0.0
1  2.0  B   1   1      0   0.0
2  3.0  C   2   2      1   1.0
3  4.0  D   3   3      1   1.0

Stream Prediction

f0    f1    f2    f3    label    pred
0    2.0    B    1    1    0    0.0
1    4.0    D    3    3    1    1.0
2    1.0    A    0    0    0    0.0
3    3.0    C    2    2    1    1.0